You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue outlines the problem with redundant allocations for dynamically generated tables for the IVF/PQ based faiss
indices and a proposed solution for 2.13. This issue does not address removing all of the redundant allocations for
IVF/PQ based indices. This will be addressed in a future issue.
Problem
For model-based k-NN indices, we load redundant information into memory per segment for search. Furthermore, we do not
account for all of this redundant overhead in our memory management system. This can lead to users under-provisioning
their clusters and OOM’ing (ref: #1301). Also, if customers do not OOM,
they are still not able to fully realize performance benefits offered by model algorithms.
For IVF and PQ, there are several tables that get loaded redundantly in memory. The basic ones are the centroids used
either for quantization in PQ and partitioning in IVF. These occupy 2^code_size*d*4 and 4*nlist*d bytes per
faiss-index, respectively. While these allocations are loaded redundantly, they are tracked accurately by the memory
management system because they are serialized with each faiss index.
There are two other tables that are more problematic:
For HNSWPQ, an table is dynamically computed at index load time (not serialized) to be able to compute distances between centroids during graph build time and takes up bytes = 4 * pq_m * 2^pq_code_size * 2^pq_code_size memory. The table is not used during search time. See https://github.com/facebookresearch/faiss/blob/v1.7.4/faiss/impl/ProductQuantizer.cpp#L781
The dynamically computed tables for HNSWPQ and IVFPQ-l2 pose a more serious issue. First, they are large. In
environments with a large amount of segments per node, they quickly occupy a significant chunk of memory. See graphs
below for comparisons when they are and are not duplicated. Second, they are not accounted for in the memory management
system because they are not serialized with the faiss index files, leading to unreliable memory throttling.
Out of Scope/Future work
Index/Query time redundant allocations — While there are duplicated allocations happening during indexing and querying, this issue only focuses on redundant allocations happening at index load (search path) time. We will take this up in the future.
Redundant storage — For some of the redundant allocations, we also redundantly store them on disk. This is inefficient, but out of scope for this document. We will take this up in the future.
Redundant allocations that are serialized - as mentioned before, the centroid tables for IVF and PQ are also duplicated. We will take this up in the future. But they are outside of the scope of this issue because they do not consume as much memory and are properly tracked in the memory management system and sharing them would require a more extensive change.
Solution
For 2.13, we will focus on removing redundancy for the dynamically computed tables for HNSWPQ and IVFPQ-l2. In the
future, we will remove more redundant allocations (another issue to come).
HNSWPQ
The HNSWPQ solution is fairly straightforward, because the table is not used during search time. So, we can just
disable. To do this, we have added a flag to disable its computation in
faiss (facebookresearch/faiss#3246). We will apply this flag to our index load logic in the faiss wrapper.
This change will remove the dynamically loaded table and cause no regression in performance
Pros:
Easy to implement
Removes redundant state
Cons: none I can think of
IVFPQ-l2
For IVFPQ-l2, the table being computed is used during search time and it does impact search latency
(see APPENDIX B). The table is specific to the l2 space type
and is not computed for inner-product indices. From a high level, faiss breaks down the l2 distance computation into
portions that are index data dependent and independent, and pre-computes the data independent portions on load and thus
is able to perform fewer operations during query. In order to build the table, faiss needs access to the pq and ivf
centroids. For more details about the table, please refer this faiss comment.
With typical values ivf_nlist=2048, pq_m=64 and pq_code_size=8 and 100 segments on the node, this table will yield a
memory footprint of about 12.5GB, when only 0.125 GB is needed.
In order to share this table amongst faiss indices, we will need to take the following 3 items:
Change faiss to be able to share this table across different faiss indices
Consume faiss changes in our JNI layer
Manage shared allocations into the plugin
For (item 1) In faiss, we will need to change the IVFPQ index structure to switch from storing the table by value to
storing the table by reference. In doing this, we will need to ensure deallocation of the table happens in a safe and
backwards compatible way. There is a discussion in faiss (facebookresearch/faiss#3271) around
the best way to do this. There is a possibility we can introduce context in order to switch re-use the table. However,
this will impact serialization, so will need to be taken up in the future. Until then, we will maintain a patch to
introduce the functionality.
For (item 2) In the k-NN JNI layer, we will need to add methods to properly load, share and free the table. To do this,
we will modify/add several functions. The JNI layer will be responsible for correctly interpreting the
SharedIndexStateAddress based off of the indexType:
/** * [MODIFIED] We will need to modify this function to disable precomputing the * ivfpq-l2 table during index reading * * @param indexPath path to index * @return Address of loaded faiss index */longloadIndex(stringindexPath);
/** * [ADDED] We will add a method to initialize the shared index state * * @param indexAddr address of the index to initialize from * @param indexType integer representing the type of index we are initializing for * @return Address of shared index state address */longinitSharedIndexState(longindexAddr, intindexType);
/** * [ADDED] Set the index state for an index * * @param indexAddr address of index to set state for * @param shareIndexStateAddr address of shared state to be set * @param indexType integer representing the type of index we are setting state for */voidsetSharedIndexState(longindexAddr, longshareIndexStateAddr);
/** * [ADDED] Deallocate memory of the shared index state * * @param shareIndexStateAddr address of shared state * @param indexType */voidfreeSharedIndexState(longshareIndexStateAddr, intindexType);
For (item 3) ideally we would like to guarantee the following via management:
We accurately estimate the memory the table will consume
We guarantee that use-after-free scenarios can never happen with the shared table
One copy of a table is in memory at a time
Once no index is using the table, it is promptly deallocated from memory
An index with shared state is not available for search until its shared state has been set
For 2.13, we will focus on guarantees 2-5. We are forgoing guarantee 1 because it is infeasible to introduce a new
allocation type other IndexAllocations depend on in the current architecture of the NativeMemoryCache. The current
memory system’s memory limitations are built completely around the NativeMemoryCache. So, the system estimates will not
be completely accurate, but by simply removing the redundant allocation they will be much more accurate than they are
currently. In a later phase, we will work on addressing the issue by refactoring the memory management system.
In order to track SharedIndexState, we will create a new cache to manage the allocations. During index loading,
SharedIndexStateAllocations will be retrieved from this cache. The cache will be implemented with thread safety
mechanisms. SharedIndexStateAllocations will be reference counted. The following diagram shows the new IndexAllocation
load workflow.
Once an IndexAllocation is deallocated, the associated SharedIndexStateAllocation’s reference count will be decremented
by 1. Once the reference count hits 0, the SharedIndexStateAllocation will be freed.
For (item 3) We will need to add a UUID per-model in the cluster. This is required because we will be caching shared
model state and will need to be able to differentiate different versions of the model (i.e. situation where user
creates model, build index, deletes model and creates another model with same id, etc.). For this work, please refer to
(TO BE CREATED ISSUE). We are holding off on the UUID work initially as this is bug in current state.
Pros:
Properly removes redundant allocations
Easy to implement in a predictable manner
Potential extendability in the future - we will need to refactor memory management to decouple management of different allocation types. Caching outside of the NativeMemoryCache will make this easier in the future.
Cons:
Does not track dynamically allocated table in the existing memory management system
Alternatives Considered
Disable the IVFPQ-l2 table completely
We could disable the IVFPQ-l2 table by passing a flag at index load time.
Pros:
Easiest to implement
Memory management system estimates will be accurate
Cons:
Query latency degradation - from APPENDIX B, we saw this resulted in a significant performance degradation
Initialize shared index state using model system index
Instead of using a loaded IndexAllocation to create the SharedIndexStateAllocation, we could read the model
from the model system index and compute the SharedIndexStateAllocation from here. Then, we pass this table to segments
after they are loaded.
Pros
Decouple model components from data components will make future separation easier
Cons
More complex implementation
Will require transport call during index load, which could potentially impact stability
Pre-compute the table at model creation and serialize with the model
Pros:
More efficient - table will only be computed once ever.
Cons:
Will cause current memory management system to be inaccurate once we remove redundant allocs. It will represent state when there were redundant allocs, so we will over-estimate
Complex to selectively load the table during index reading
End State
For IVFPQ-l2 and HNSWPQ, the dynamically computed tables will no longer be replicated in the system. The memory
management estimate will be more accurate than before, but still not complete because the IVFPQ-l2 shared allocations
will not be tracked.
For HNSWPQ, the memory estimate for customers will be the following, where the only per-segment overhead will be the
centroid table:
where overhead is 24 bytes (determined empirically).
An example projection before and after the fix, with the ideal, will look like:
For IVFPQ-l2, the memory estimate for customers will be the following, where the per-segment overhead will be the PQ
centroid table and the ivf bucket table:
An example projection before and after the fix, with the ideal, will look like:
Backwards Compatibility
This solution is backwards compatible for the following reasons:
We are not changing any serialization/deserialization logic
We are not adding any additional cluster state
Change does not interact with transport layer
Metrics
No metrics will be added with this change. The main reason being that in the future, we will be refactoring the memory
management system. If we add more metrics here, we will just have to deprecate them in the future, which does not make
sense.
We will add relevant logging metrics.
Appendix A: Experiments Showing Memory Gaps
This section details several experiments that were conducted to show the difference between what we estimate, what the
plugin estimates and what we observe.
Experimental Procedure
Create training index
Ingest training vectors (random data)
Train model
Create index from model
Ingest vectors (random data)
Force merge to 1 segment
Stop/Start docker to reset memory
Track memory usage
Warmup index
Track memory usage
Calculate sum of all faiss files
IVF
Conclusions
For IVF, the estimate when accounting for num_segments is consistent with file size estimate and formula estimate, for both inner product and l2 distances
HNSWPQ
Conclusions
Revised estimate (est 4) including sdc allocation closely approximates actual memory consumption for both l2 and innerproduct
Introduction
This issue outlines the problem with redundant allocations for dynamically generated tables for the IVF/PQ based faiss
indices and a proposed solution for 2.13. This issue does not address removing all of the redundant allocations for
IVF/PQ based indices. This will be addressed in a future issue.
Problem
For model-based k-NN indices, we load redundant information into memory per segment for search. Furthermore, we do not
account for all of this redundant overhead in our memory management system. This can lead to users under-provisioning
their clusters and OOM’ing (ref: #1301). Also, if customers do not OOM,
they are still not able to fully realize performance benefits offered by model algorithms.
For IVF and PQ, there are several tables that get loaded redundantly in memory. The basic ones are the centroids used
either for quantization in PQ and partitioning in IVF. These occupy
2^code_size*d*4
and4*nlist*d
bytes perfaiss-index, respectively. While these allocations are loaded redundantly, they are tracked accurately by the memory
management system because they are serialized with each faiss index.
There are two other tables that are more problematic:
bytes = 4 * pq_m * 2^pq_code_size * 2^pq_code_size
memory. The table is not used during search time. See https://github.com/facebookresearch/faiss/blob/v1.7.4/faiss/impl/ProductQuantizer.cpp#L781bytes = 4 * ivf_nlist * pq_m * 2^pq_code_size
memory. See https://github.com/facebookresearch/faiss/blob/v1.7.4/faiss/IndexIVFPQ.cpp#L388-L416The dynamically computed tables for HNSWPQ and IVFPQ-l2 pose a more serious issue. First, they are large. In
environments with a large amount of segments per node, they quickly occupy a significant chunk of memory. See graphs
below for comparisons when they are and are not duplicated. Second, they are not accounted for in the memory management
system because they are not serialized with the faiss index files, leading to unreliable memory throttling.
Out of Scope/Future work
Solution
For 2.13, we will focus on removing redundancy for the dynamically computed tables for HNSWPQ and IVFPQ-l2. In the
future, we will remove more redundant allocations (another issue to come).
HNSWPQ
The HNSWPQ solution is fairly straightforward, because the table is not used during search time. So, we can just
disable. To do this, we have added a flag to disable its computation in
faiss (facebookresearch/faiss#3246). We will apply this flag to our index load logic in the
faiss wrapper.
This change will remove the dynamically loaded table and cause no regression in performance
Pros:
Cons: none I can think of
IVFPQ-l2
For IVFPQ-l2, the table being computed is used during search time and it does impact search latency
(see APPENDIX B). The table is specific to the l2 space type
and is not computed for inner-product indices. From a high level, faiss breaks down the l2 distance computation into
portions that are index data dependent and independent, and pre-computes the data independent portions on load and thus
is able to perform fewer operations during query. In order to build the table, faiss needs access to the pq and ivf
centroids. For more details about the table, please refer this
faiss comment.
With typical values ivf_nlist=2048, pq_m=64 and pq_code_size=8 and 100 segments on the node, this table will yield a
memory footprint of about 12.5GB, when only 0.125 GB is needed.
In order to share this table amongst faiss indices, we will need to take the following 3 items:
For (item 1) In faiss, we will need to change the IVFPQ index structure to switch from storing the table by value to
storing the table by reference. In doing this, we will need to ensure deallocation of the table happens in a safe and
backwards compatible way. There is a discussion in faiss (facebookresearch/faiss#3271) around
the best way to do this. There is a possibility we can introduce context in order to switch re-use the table. However,
this will impact serialization, so will need to be taken up in the future. Until then, we will maintain a patch to
introduce the functionality.
For (item 2) In the k-NN JNI layer, we will need to add methods to properly load, share and free the table. To do this,
we will modify/add several functions. The JNI layer will be responsible for correctly interpreting the
SharedIndexStateAddress based off of the indexType:
For (item 3) ideally we would like to guarantee the following via management:
For 2.13, we will focus on guarantees 2-5. We are forgoing guarantee 1 because it is infeasible to introduce a new
allocation type other IndexAllocations depend on in the current architecture of the NativeMemoryCache. The current
memory system’s memory limitations are built completely around the NativeMemoryCache. So, the system estimates will not
be completely accurate, but by simply removing the redundant allocation they will be much more accurate than they are
currently. In a later phase, we will work on addressing the issue by refactoring the memory management system.
In order to track SharedIndexState, we will create a new cache to manage the allocations. During index loading,
SharedIndexStateAllocations will be retrieved from this cache. The cache will be implemented with thread safety
mechanisms. SharedIndexStateAllocations will be reference counted. The following diagram shows the new IndexAllocation
load workflow.
Once an IndexAllocation is deallocated, the associated SharedIndexStateAllocation’s reference count will be decremented
by 1. Once the reference count hits 0, the SharedIndexStateAllocation will be freed.
For (item 3) We will need to add a UUID per-model in the cluster. This is required because we will be caching sharedWe are holding off on the UUID work initially as this is bug in current state.model state and will need to be able to differentiate different versions of the model (i.e. situation where user
creates model, build index, deletes model and creates another model with same id, etc.). For this work, please refer to
(TO BE CREATED ISSUE).
Pros:
Cons:
Alternatives Considered
Disable the IVFPQ-l2 table completely
We could disable the IVFPQ-l2 table by passing a flag at index load time.
Pros:
Cons:
Initialize shared index state using model system index
Instead of using a loaded IndexAllocation to create the SharedIndexStateAllocation, we could read the model
from the model system index and compute the SharedIndexStateAllocation from here. Then, we pass this table to segments
after they are loaded.
Pros
Cons
Pre-compute the table at model creation and serialize with the model
Pros:
Cons:
End State
For IVFPQ-l2 and HNSWPQ, the dynamically computed tables will no longer be replicated in the system. The memory
management estimate will be more accurate than before, but still not complete because the IVFPQ-l2 shared allocations
will not be tracked.
For HNSWPQ, the memory estimate for customers will be the following, where the only per-segment overhead will be the
centroid table:
where overhead is 24 bytes (determined empirically).
An example projection before and after the fix, with the ideal, will look like:
For IVFPQ-l2, the memory estimate for customers will be the following, where the per-segment overhead will be the PQ
centroid table and the ivf bucket table:
An example projection before and after the fix, with the ideal, will look like:

Backwards Compatibility
This solution is backwards compatible for the following reasons:
Metrics
No metrics will be added with this change. The main reason being that in the future, we will be refactoring the memory
management system. If we add more metrics here, we will just have to deprecate them in the future, which does not make
sense.
We will add relevant logging metrics.
Appendix A: Experiments Showing Memory Gaps
This section details several experiments that were conducted to show the difference between what we estimate, what the
plugin estimates and what we observe.
Experimental Procedure
IVF
Conclusions
HNSWPQ
Conclusions
IVFPQ
Conclusions
Appendix B: Experiments with and without table IVFPQ-l2
Experiment shows latency implications for IVFPQ-L2 with and without the precomputed table enabled.
Experimental Procedure
Results w/ optimized query
Results w/o optimized query
Conclusions
The text was updated successfully, but these errors were encountered: