Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranking: add phrase boosting to BM25 #917

Merged
merged 6 commits into from
Feb 21, 2025
Merged

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Feb 19, 2025

Relates to SPLF-838

With this change we recognize boosted queries in our BM25 scoring and adjust the overall score accordingly.

We need to take care of 2 parts: The overall BM25 score of the document, and the line score determining the order in which we return the chunks.

Test plan:

  • new scoring test
  • Evaluations look good. I will post them in the ticket

With this change we recognize boosted queries in our bm25 scoring and
adjust the overall score accordingly.

We need to take care of 2 parts: The overall bm25 score of the document,
and the line score determining the order in which we return the chunks.

Test plan:
new scoring test
@stefanhengl stefanhengl marked this pull request as ready for review February 19, 2025 13:13
@jtibshirani
Copy link
Member

Some thoughts on the approach:

  • In this PR, we boost the final line scores, but not the final file score. (For files we only perform the term-boosting within the BM25F calculation). This means that files with a phrase match are still not always ranked highly. See my example from the Sourcegraph PR.
  • The meaning of 'query boost' is now a bit difficult to understand, since it applies both to the final scores, and as a term boost within BM25F. Also, the term boost ignores the actual boost value.

Here is an alternate suggestion, which only applies the final boost, skipping the BM25F boost. It also uses a maximum to ensure we don't apply the boost multiple times for the same match tree. The mental model: query boosting multiplies the whole BM25 score (or classic Zoekt score) by a certain value. This is how query boosting works in many other systems like Lucene, it doesn't affect the term frequency calculation itself.

I pushed to this branch: https://github.com/sourcegraph/zoekt/tree/jtibs/phrase-boost

Eval results look good! Here is this PR vs. my hacky branch.
Screenshot 2025-02-19 at 10 14 09 AM
Screenshot 2025-02-19 at 10 19 27 AM

@stefanhengl
Copy link
Member Author

Some thoughts on the approach:

  • In this PR, we boost the final line scores, but not the final file score. (For files we only perform the term-boosting within the BM25F calculation). This means that files with a phrase match are still not always ranked highly. See my example from the Sourcegraph PR.

The intention was to let BM25 decide which document is most relevant and use term frequency boosting to nudge the score a bit, just like we do for symbols. Admittedly, the impact is small but it is just as big as the impact of a symbol match, which makes sense to me as a mental model.

Boosting the overall score of a document feels like a blunt-force approach, effectively overriding the BM25 calculation and almost nullifying its ranking impact. I am not as familiar with how Lucene does it so I cannot argue with that. Generally, I am in favor of copying industry standards unless we have a strong reason not to.

  • The meaning of 'query boost' is now a bit difficult to understand, since it applies both to the final scores, and as a term boost within BM25F. Also, the term boost ignores the actual boost value.

I didn't use the actual boost value of the overall BM25 score, because (1) it saturates quickly and might be confusing because doubling the boost won't have double the effect and (2) I already use it for line scoring.

If I read your eval results correctly, I see your branch improves the result for one query (?), but without confidence intervals, it's hard to determine whether this is a significant improvement. I would have liked to see a bigger shift to make the decision more obvious.

@jtibshirani
Copy link
Member

jtibshirani commented Feb 20, 2025

Boosting the overall score of a document feels like a blunt-force approach, effectively overriding the BM25 calculation and almost nullifying its ranking impact.

Yes, this is actually the intention of the 'phrase boost' as we originally developed it. In most cases, it's supposed to "float" the phrase results above the keyword results. This is what query boosting typically does in search engines like Lucene, it applies a boost to an entire query clause that multiplies its final score. It does not affect the inner workings of BM25.

My main concern with the current approach is that it doesn't surface the right file for many queries that currently work with our default Zoekt scoring. These would be perceived as a regression. For example:

If I read your eval results correctly, I see your branch improves the result for one query (?)

Sorry for the confusion, I wasn't claiming that this alternate approach performs better than this PR. Only that it has a similar positive effect over the baseline, so it also works as intended.

@jtibshirani
Copy link
Member

jtibshirani commented Feb 21, 2025

@stefanhengl and I caught up offline and sorted things out. After our conversation, I did realize that our boosting approach is hacky and confusing, now that we've brought in BM25. We should eventually work to remove it (I will keep an eye on this!)

I pushed my alternate approach to this branch. I also added boosting to the debug score, so you can tell why some scores are very large:
Screenshot 2025-02-21 at 8 54 33 AM

@jtibshirani jtibshirani merged commit 3d43fdf into main Feb 21, 2025
9 checks passed
@jtibshirani jtibshirani deleted the sh/bm25-phrase-boosting branch February 21, 2025 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants