Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize Custom search() Method #1826

Open
sam-hey opened this issue Jan 17, 2025 · 2 comments
Open

Generalize Custom search() Method #1826

sam-hey opened this issue Jan 17, 2025 · 2 comments

Comments

@sam-hey
Copy link
Contributor

sam-hey commented Jan 17, 2025

Currently, only BM25 uses a custom implementation of the search() method, achieved by checking if the model name is bm25s. This approach is not scalable or practical for future implementations requiring custom search methods, such as ColBERT with an index. A more flexible and modular solution is needed to accommodate diverse search strategies.

        elif (
            hasattr(self.retriever.model.model, "mteb_model_meta")
            and self.retriever.model.model.mteb_model_meta.name == "bm25s"
        ):
            return self.retriever.model.model.search(
                corpus,
                queries,
                self.top_k,
                task_name=self.task_name,  # type: ignore
            )

https://github.com/embeddings-benchmark/mteb/blob/main/mteb/evaluation/evaluators/RetrievalEvaluator.py#L472:L475

@KennethEnevoldsen
Copy link
Contributor

Completely agree. Would be nice to just specify an interface for this such that any model could implement with. (cc @orionw)

@orionw
Copy link
Contributor

orionw commented Jan 17, 2025

Thanks for raising @sam-hey!

I can definitely see the benefit! On the other hand, having it standardized makes it so each model class has the same function and is more reliable that way.

I can see both sides, but personally I think I would prefer to keep the core search functions in MTEB, so users can see them there and assume each model searches the same within their own “class” (eg that all dense retrievers use the same base functionality). I think it’d be great if we made BM25 a first class MTEB model so we didn’t have to rely on that (and could also add other sparse non-neural versions like Pyserini).

OTOH, there are probably 3 ish other model “classes” or types that would involve a different search functionality: multi-vector (like ColBERT as you say), and then perhaps neural sparse retrieval (like Splade) and generative retrieval.

So we should definitely make it so that each of these could be added, which as @KennethEnevoldsen says likely involves a change to the interface. But since there are less than 10 model “classes”, it seems like we could do that with an if statement. But perhaps it’s too early in the morning and I’m missing something!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants