This project focuses on fine-tuning cross-encoder re-rankers and evaluating them for the MS MARCO dataset. Additionally, it explores ensemble methods for combining different models' ranking outputs and implements query expansion techniques using Large Language Models (LLMs) to enhance retrieval performance
This project involves fine-tuning cross-encoder re-rankers and evaluating them on the MS MARCO dataset. The models used are MiniLM, TinyBERT, and DistilRoBERTa. Additionally, the project explores ensemble methods for combining different models' ranking outputs and uses Large Language Models (LLMs) for query expansion to improve document retrieval.
Fine-tune three pre-trained cross-encoder models for the MS MARCO re-ranking task. Each model is fine-tuned for one hour, and its performance is evaluated using the TREC DL’19 dataset. The metrics used for evaluation are:
- NDCG@10 (Normalized Discounted Cumulative Gain at rank 10)
- Recall@100 (Proportion of relevant documents retrieved within the top 100)
- MAP@1000 (Mean Average Precision at rank 1000)
- cross-encoder/ms-marco-MiniLM-L-2-v2
- cross-encoder/ms-marco-TinyBERT-L-2-v2
- distilroberta-base
Each model is fine-tuned using the Adam optimizer with a learning rate of 0.00002, and the training process involves a warm-up phase of 5000 steps.
Apply five ensemble methods to the ranking outputs of the fine-tuned models. These methods combine the individual rankings into a single aggregated rank to improve retrieval effectiveness. The ensemble methods used are:
- Sum
- MNZ (Multiplicative Normalization)
- RRF (Reciprocal Rank Fusion)
- Max
- Min
The effectiveness of these methods is evaluated using the following metrics:
- NDCG@10
- Recall@100
- MAP@1000
Select the most effective ensemble method identified in Task 2 and apply it to all possible combinations of the fine-tuned models to evaluate the performance. The combinations of models are:
- MiniLM + TinyBERT
- MiniLM + DistilRoBERTa
- TinyBERT + DistilRoBERTa
- MiniLM + TinyBERT + DistilRoBERTa
The performance is evaluated again using NDCG@10, Recall@100, and MAP@1000.
Implement query expansion using a Large Language Model (LLM) for the given 43 queries. Two methods are explored:
- Query Expansion without Pseudo-Relevance Feedback (PRF) using an LLM.
- Query Expansion with PRF using the top 3 documents retrieved from the original query.
The effectiveness of these expansions is evaluated using the same metrics as in the previous tasks.
This project is implemented across several Python files and a Jupyter notebook:
fine_tuning_cross_encoders.py
: Fine-tuning the cross-encoder models on the MS MARCO dataset.evaluate_model_1.py
: Evaluates the first fine-tuned model and generates ranking files.evaluate_model_2.py
: Evaluates the second fine-tuned model and generates ranking files.evaluate_model_3.py
: Evaluates the third fine-tuned model and generates ranking files.ensemble_methods_and_evaluation.py
: Implements ensemble methods and evaluates the performance of different model combinations.Fine_Tuning_and_Query_Expansion_for_IR.ipynb
: Implements query expansion using an LLM for the given queries.
Model | NDCG@10 | Recall@100 | MAP@1000 | Training Steps |
---|---|---|---|---|
MiniLM | 0.45 | 0.60 | 0.30 | 11169 |
TinyBERT | 0.40 | 0.55 | 0.28 | 31999 |
DistilRoBERTa | 0.43 | 0.58 | 0.29 | 999 |
Ensemble Method | NDCG@10 | Recall@100 | MAP@1000 |
---|---|---|---|
Sum | 0.65 | 0.51 | 0.44 |
MNZ | 0.65 | 0.51 | 0.44 |
RRF | 0.68 | 0.51 | 0.45 |
Max | 0.62 | 0.50 | 0.41 |
Min | 0.62 | 0.43 | 0.39 |
- Fine-Tuning Results: The MiniLM model performed the best in terms of NDCG@10, suggesting that it is better at ranking the most relevant documents. DistilRoBERTa showed higher Recall@100, indicating its ability to retrieve a broader set of relevant documents.
- Ensemble Methods: The RRF method showed the best performance, combining the strengths of individual models. It achieved the highest NDCG@10, Recall@100, and MAP@1000 scores.
- Best Ensemble Combination: The combination of MiniLM + TinyBERT performed the best in terms of NDCG@10, Recall@100, and MAP@1000, making it the most effective model combination.
This project demonstrates the process of fine-tuning and evaluating cross-encoder models for MS MARCO retrieval tasks. The use of ensemble methods significantly improves retrieval performance, and MiniLM + TinyBERT emerges as the most effective combination for document re-ranking.
- Ranx Fuse: Bassani, E., et al. "ranx.fuse: A Python Library for Metasearch." Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022.
- Query Expansion: Jagerman, R., et al. "Query Expansion by Prompting Large Language Models." ACM SIGIR 2023.
- MS MARCO: https://microsoft.github.io/msmarco/
- TREC DL'19: https://trec.nist.gov/data/deep/2019qrels-pass.txt