-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support BM25 scoring for chunk matches #889
Conversation
213739a
to
7c931a6
Compare
// scoreChunk calculates the score for each line in the chunk based on its candidate matches, and returns the score of | ||
// the best-scoring line, along with its line number. | ||
// Invariant: there should be at least one input candidate, len(ms) > 0. | ||
func (p *contentProvider) scoreChunk(ms []*candidateMatch, language string, opts *SearchOptions) (chunkScore, []*Symbol) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we always score chunks line-by-line, in keeping with the mental model "the score of a chunk is the score of its best line". For classic Zoekt scoring, we actually just care about the best-scoring candidate match... so it's not necessary to split up the chunk by line, and this may add a little overhead.
I added BenchmarkScoreChunkMatches
below and did not see a significant difference:
Before, 3 runs
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10 182 6213881 ns/op 3204619 B/op 22198 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10 184 6318714 ns/op 3203388 B/op 22198 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10 180 6415737 ns/op 3204673 B/op 22198 allocs/op
After, 3 runs
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10 182 6451712 ns/op 3202811 B/op 22197 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10 172 6627135 ns/op 3202978 B/op 22198 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10 180 6460295 ns/op 3204055 B/op 22198 allocs/op
I also reran BenchmarkShardedSearch
and saw no difference.
// addScore increments the score of the FileMatch by the computed score. If | ||
// debugScore is true, it also adds a debug string to the FileMatch. If raw is | ||
// -1, it is ignored. Otherwise, it is added to the debug string. | ||
func (m *FileMatch) addScore(what string, computed float64, raw float64, debugScore bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this here from score.go
so all FileMatch
methods are in the same place.
|
||
// candidateMatchScore scores all candidate matches and returns the best-scoring match plus its score information. | ||
// Invariant: there should be at least one input candidate, len(ms) > 0. | ||
func (p *contentProvider) candidateMatchScore(ms []*candidateMatch, language string, debug bool) (scoredMatch, []*Symbol) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved to score.go
and renamed to scoreLine
.
From my testing, this really helps us select better top-chunks when BM25 is enabled. It makes a big difference in evals, bringing us from 0% on "Find error" queries to > 80%, and above 50% on challenging "Fuzzy symbol search" and "Find logic" queries. I've also played around with it a bunch locally, and the results qualitatively feel way better. I looked through the eval queries where we still fail, and we are now very often getting the right file and a "reasonable" top snippet, just not the ideal one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! I'll defer to others with more expertise for an in-depth review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
// on it being non-nil to cache the read. | ||
if nl == nil { | ||
nl = make([]uint32, 0) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find
Currently, BM25 scoring only applies to the overall
FileMatch
score. The algorithm gathered term frequencies from all candidate matches in the file to produce a file-level score. HoweverLineMatch
andChunkMatch
scores were still calculated using the classic Zoekt scoring algorithm.This PR implements BM25 scoring for
LineMatch
andChunkMatch
. It does so by calculating a BM25 per line. Compared to the classic Zoekt algorithm, this rewards multiple term matches on a line. Because our term frequency calculation also boosts symbol matches, the score smoothly balances between "many term matches" and "interesting term matches".Now, the code is structured as follows:
scoreChunk
: goes through each line in the chunk, calculating its score throughscoreLine
, and returns the best-scoring linescoreLine
: calculates the score for a single lineThe mental model is that "the score of a chunk is always the score of its best line".
Closes SPLF-759