Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add Support for Configurable n_list Limit in OpenSearch's FAISS KNN Implementation #2483

Open
StaVorosh opened this issue Feb 3, 2025 · 3 comments

Comments

@StaVorosh
Copy link

StaVorosh commented Feb 3, 2025

Is your feature request related to a problem?
Yes, the current limitation of n_list to 20,000 in OpenSearch's FAISS KNN implementation restricts the ability to fine-tune the balance between search accuracy and performance for larger datasets (over than 10 billion of vectors). This can be particularly frustrating when working with high-dimensional data or large-scale vector search use cases, where a higher n_list value could improve recall and precision.

What solution would you like?
I would like the ability to configure or increase the n_list parameter beyond the current limit of 20,000. This would allow users to better optimize FAISS's IVF for their specific datasets and use cases. Ideally, this could be implemented as a configurable parameter in the OpenSearch KNN plugin, with appropriate warnings or documentation about the potential performance trade-offs.

What alternatives have you considered?
Using other indexing methods, such as HNSW, which may not require n_list but have their own trade-offs in terms of memory usage and search performance.

Running standalone FAISS outside of OpenSearch, though this would sacrifice distributed capabilities of OpenSearch.

Adjusting other FAISS parameters (e.g., n_probe) to compensate for the lower n_list, but this does not always provide the desired level of accuracy.

Do you have any additional context?
I noticed that OpenSearch imposes a limit of 20,000 for the n_list parameter when using FAISS for KNN search. Could you please explain the reasoning behind this limitation? Specifically:

  • Is this restriction related to performance considerations, such as indexing or query latency?
  • Are there technical constraints in the integration of FAISS with OpenSearch that necessitate this limit?
  • Are there plans to increase or make this limit configurable in future releases?

Additionally, if I need a higher n_list value for my use case, what alternatives or workarounds would you recommend?

Thank you for your insights!

@jmazanec15
Copy link
Member

Initially we were conservative with respect to this parameter. So, we set max at 20K. What nlist value are you looking to support?

@StaVorosh
Copy link
Author

Thanks for the response! For large-scale datasets (e.g., 1B+ vectors), we need n_list in the range of hundreds of thousands to millions of centroids (e.g., 100K–1M) to achieve optimal accuracy.

@jmazanec15
Copy link
Member

jmazanec15 commented Feb 6, 2025

For OpenSearch, we have one IVF per segment per shard. So, typically, we dont recommend having too large of clusters because a significant portion of the data structures will be duplicated. For instance, if you have a 10 shards on a node, with 10 segments, the centroids will be duplicated 10x10 times. We've mitigated some of this overhead (#1507) but it still doesnt remove all of it. Thus, having too large of centroids can cause memory concerns and thats why we capped at 20k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants