-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranking: add phrase boosting to BM25 #917
Conversation
With this change we recognize boosted queries in our bm25 scoring and adjust the overall score accordingly. We need to take care of 2 parts: The overall bm25 score of the document, and the line score determining the order in which we return the chunks. Test plan: new scoring test
Some thoughts on the approach:
Here is an alternate suggestion, which only applies the final boost, skipping the BM25F boost. It also uses a maximum to ensure we don't apply the boost multiple times for the same match tree. The mental model: query boosting multiplies the whole BM25 score (or classic Zoekt score) by a certain value. This is how query boosting works in many other systems like Lucene, it doesn't affect the term frequency calculation itself. I pushed to this branch: https://github.com/sourcegraph/zoekt/tree/jtibs/phrase-boost Eval results look good! Here is this PR vs. my hacky branch. |
The intention was to let BM25 decide which document is most relevant and use term frequency boosting to nudge the score a bit, just like we do for symbols. Admittedly, the impact is small but it is just as big as the impact of a symbol match, which makes sense to me as a mental model. Boosting the overall score of a document feels like a blunt-force approach, effectively overriding the BM25 calculation and almost nullifying its ranking impact. I am not as familiar with how Lucene does it so I cannot argue with that. Generally, I am in favor of copying industry standards unless we have a strong reason not to.
I didn't use the actual boost value of the overall BM25 score, because (1) it saturates quickly and might be confusing because doubling the boost won't have double the effect and (2) I already use it for line scoring. If I read your eval results correctly, I see your branch improves the result for one query (?), but without confidence intervals, it's hard to determine whether this is a significant improvement. I would have liked to see a bigger shift to make the decision more obvious. |
Yes, this is actually the intention of the 'phrase boost' as we originally developed it. In most cases, it's supposed to "float" the phrase results above the keyword results. This is what query boosting typically does in search engines like Lucene, it applies a boost to an entire query clause that multiplies its final score. It does not affect the inner workings of BM25. My main concern with the current approach is that it doesn't surface the right file for many queries that currently work with our default Zoekt scoring. These would be perceived as a regression. For example:
Sorry for the confusion, I wasn't claiming that this alternate approach performs better than this PR. Only that it has a similar positive effect over the baseline, so it also works as intended. |
@stefanhengl and I caught up offline and sorted things out. After our conversation, I did realize that our boosting approach is hacky and confusing, now that we've brought in BM25. We should eventually work to remove it (I will keep an eye on this!) I pushed my alternate approach to this branch. I also added boosting to the debug score, so you can tell why some scores are very large: |
Relates to SPLF-838
With this change we recognize boosted queries in our BM25 scoring and adjust the overall score accordingly.
We need to take care of 2 parts: The overall BM25 score of the document, and the line score determining the order in which we return the chunks.
Test plan: