Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BM25 scoring for chunk matches #889

Merged
merged 2 commits into from
Jan 15, 2025
Merged

Support BM25 scoring for chunk matches #889

merged 2 commits into from
Jan 15, 2025

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Jan 14, 2025

Currently, BM25 scoring only applies to the overall FileMatch score. The algorithm gathered term frequencies from all candidate matches in the file to produce a file-level score. However LineMatch and ChunkMatch scores were still calculated using the classic Zoekt scoring algorithm.

This PR implements BM25 scoring for LineMatch and ChunkMatch. It does so by calculating a BM25 per line. Compared to the classic Zoekt algorithm, this rewards multiple term matches on a line. Because our term frequency calculation also boosts symbol matches, the score smoothly balances between "many term matches" and "interesting term matches".

Now, the code is structured as follows:

  • scoreChunk: goes through each line in the chunk, calculating its score through scoreLine, and returns the best-scoring line
  • scoreLine: calculates the score for a single line

The mental model is that "the score of a chunk is always the score of its best line".

Closes SPLF-759

// scoreChunk calculates the score for each line in the chunk based on its candidate matches, and returns the score of
// the best-scoring line, along with its line number.
// Invariant: there should be at least one input candidate, len(ms) > 0.
func (p *contentProvider) scoreChunk(ms []*candidateMatch, language string, opts *SearchOptions) (chunkScore, []*Symbol) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we always score chunks line-by-line, in keeping with the mental model "the score of a chunk is the score of its best line". For classic Zoekt scoring, we actually just care about the best-scoring candidate match... so it's not necessary to split up the chunk by line, and this may add a little overhead.

I added BenchmarkScoreChunkMatches below and did not see a significant difference:

Before, 3 runs
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10         	     182	   6213881 ns/op	 3204619 B/op	   22198 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10         	     184	   6318714 ns/op	 3203388 B/op	   22198 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10         	     180	   6415737 ns/op	 3204673 B/op	   22198 allocs/op
After, 3 runs
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10         	     182	   6451712 ns/op	 3202811 B/op	   22197 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10         	     172	   6627135 ns/op	 3202978 B/op	   22198 allocs/op
BenchmarkScoreChunkMatches/score_large_ChunkMatch-10         	     180	   6460295 ns/op	 3204055 B/op	   22198 allocs/op

I also reran BenchmarkShardedSearch and saw no difference.

// addScore increments the score of the FileMatch by the computed score. If
// debugScore is true, it also adds a debug string to the FileMatch. If raw is
// -1, it is ignored. Otherwise, it is added to the debug string.
func (m *FileMatch) addScore(what string, computed float64, raw float64, debugScore bool) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this here from score.go so all FileMatch methods are in the same place.


// candidateMatchScore scores all candidate matches and returns the best-scoring match plus its score information.
// Invariant: there should be at least one input candidate, len(ms) > 0.
func (p *contentProvider) candidateMatchScore(ms []*candidateMatch, language string, debug bool) (scoredMatch, []*Symbol) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved to score.go and renamed to scoreLine.

@jtibshirani
Copy link
Member Author

jtibshirani commented Jan 14, 2025

From my testing, this really helps us select better top-chunks when BM25 is enabled. It makes a big difference in evals, bringing us from 0% on "Find error" queries to > 80%, and above 50% on challenging "Fuzzy symbol search" and "Find logic" queries.

Before/After
Screenshot 2025-01-14 at 2 21 30 PMScreenshot 2025-01-14 at 2 21 30 PM
Screenshot 2025-01-14 at 2 21 30 PMScreenshot 2025-01-14 at 2 21 30 PM

I've also played around with it a bunch locally, and the results qualitatively feel way better. I looked through the eval queries where we still fail, and we are now very often getting the right file and a "reasonable" top snippet, just not the ideal one.

@jtibshirani jtibshirani requested review from camdencheek and a team January 14, 2025 22:31
Copy link

@janhartman janhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I'll defer to others with more expertise for an in-depth review.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

// on it being non-nil to cache the read.
if nl == nil {
nl = make([]uint32, 0)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find

@jtibshirani jtibshirani merged commit 4c8bb19 into main Jan 15, 2025
9 checks passed
@jtibshirani jtibshirani deleted the jtibs/line-bm25 branch January 15, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants