Skip to content

Commit

Permalink
Fix BM25 score for PhraseDocIterator (#2404)
Browse files Browse the repository at this point in the history
### What problem does this PR solve?

Update IDF formula for PhraseDocIterator
reference:
https://lucene.apache.org/core/10_1_0/core/org/apache/lucene/search/similarities/BM25Similarity.html#idfExplain(org.apache.lucene.search.CollectionStatistics,org.apache.lucene.search.TermStatistics[])

Issue link:#1320

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Test cases
  • Loading branch information
yangzq50 authored Dec 24, 2024
1 parent ede4d2d commit 4973930
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 9 deletions.
12 changes: 8 additions & 4 deletions src/storage/invertedindex/search/phrase_doc_iterator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,14 @@ void PhraseDocIterator::InitBM25Info(UniquePtr<FullTextColumnLengthReader> &&col
constexpr float b = 0.75F;

column_length_reader_ = std::move(column_length_reader);
u64 total_df = column_length_reader_->GetTotalDF();
float avg_column_len = column_length_reader_->GetAvgColumnLength();
float smooth_idf = std::log1p((total_df - estimate_doc_freq_ + 0.5F) / (estimate_doc_freq_ + 0.5F));
bm25_common_score_ = weight_ * smooth_idf * (k1 + 1.0F);
const u64 total_df = column_length_reader_->GetTotalDF();
const float avg_column_len = column_length_reader_->GetAvgColumnLength();
float total_idf = 0.0f;
for (const auto &iter : pos_iters_) {
const auto doc_freq = iter->GetDocFreq();
total_idf += std::log1p((total_df - doc_freq + 0.5F) / (doc_freq + 0.5F));
}
bm25_common_score_ = weight_ * total_idf * (k1 + 1.0F);
bm25_score_upper_bound_ = bm25_common_score_ / (1.0F + k1 * b / avg_column_len);
f1 = k1 * (1.0F - b);
f2 = k1 * b / avg_column_len;
Expand Down
10 changes: 5 additions & 5 deletions test/sql/dql/fulltext/fulltext.slt
Original file line number Diff line number Diff line change
Expand Up @@ -36,26 +36,26 @@ Anarchism 30-APR-2012 03:25:17.000 0 22.299635
query TTIR rowsort
SELECT doctitle, docdate, ROW_ID(), SCORE() FROM sqllogic_test_enwiki SEARCH MATCH TEXT ('body^5', '"social customs"', 'topn=3;block_max=compare') USING INDEXES ('ft_index');
----
Anarchism 30-APR-2012 03:25:17.000 6 20.753590
Anarchism 30-APR-2012 03:25:17.000 6 27.133215

# only phrase
query TTIR rowsort
SELECT doctitle, docdate, ROW_ID(), SCORE() FROM sqllogic_test_enwiki SEARCH MATCH TEXT ('body^5', '"social customs"', 'topn=3;block_max=compare');
----
Anarchism 30-APR-2012 03:25:17.000 6 20.753590
Anarchism 30-APR-2012 03:25:17.000 6 27.133215

# phrase and term
query TTIR rowsort
SELECT doctitle, docdate, ROW_ID(), SCORE() FROM sqllogic_test_enwiki SEARCH MATCH TEXT ('body^5', '"social customs" harmful', 'topn=3');
----
Anarchism 30-APR-2012 03:25:17.000 0 22.299635
Anarchism 30-APR-2012 03:25:17.000 6 20.753590
Anarchism 30-APR-2012 03:25:17.000 6 27.133215

# phrase and term
query TTIR rowsort
SELECT doctitle, docdate, ROW_ID(), SCORE() FROM sqllogic_test_enwiki SEARCH MATCH TEXT ('body^5', '"social customs" harmful', 'topn=3;threshold=21.5');
SELECT doctitle, docdate, ROW_ID(), SCORE() FROM sqllogic_test_enwiki SEARCH MATCH TEXT ('body^5', '"social customs" harmful', 'topn=3;threshold=25');
----
Anarchism 30-APR-2012 03:25:17.000 0 22.299635
Anarchism 30-APR-2012 03:25:17.000 6 27.133215

# copy data from csv file
query I
Expand Down

0 comments on commit 4973930

Please sign in to comment.