Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant dynamically allocated tables at load time #1507

Closed
jmazanec15 opened this issue Mar 5, 2024 · 1 comment
Closed

Remove redundant dynamically allocated tables at load time #1507

jmazanec15 opened this issue Mar 5, 2024 · 1 comment
Assignees
Labels
Bug Fixes Changes to a system or product designed to handle a programming bug/glitch v2.13.0

Comments

@jmazanec15
Copy link
Member

jmazanec15 commented Mar 5, 2024

Introduction

This issue outlines the problem with redundant allocations for dynamically generated tables for the IVF/PQ based faiss
indices and a proposed solution for 2.13. This issue does not address removing all of the redundant allocations for
IVF/PQ based indices. This will be addressed in a future issue.

Problem

For model-based k-NN indices, we load redundant information into memory per segment for search. Furthermore, we do not
account for all of this redundant overhead in our memory management system. This can lead to users under-provisioning
their clusters and OOM’ing (ref: #1301). Also, if customers do not OOM,
they are still not able to fully realize performance benefits offered by model algorithms.

For IVF and PQ, there are several tables that get loaded redundantly in memory. The basic ones are the centroids used
either for quantization in PQ and partitioning in IVF. These occupy 2^code_size*d*4 and 4*nlist*d bytes per
faiss-index, respectively. While these allocations are loaded redundantly, they are tracked accurately by the memory
management system because they are serialized with each faiss index.

There are two other tables that are more problematic:

  1. For HNSWPQ, an table is dynamically computed at index load time (not serialized) to be able to compute distances between centroids during graph build time and takes up bytes = 4 * pq_m * 2^pq_code_size * 2^pq_code_size memory. The table is not used during search time. See https://github.com/facebookresearch/faiss/blob/v1.7.4/faiss/impl/ProductQuantizer.cpp#L781
  2. For IVFPQ-l2, there is a dynamically computed table that stores components of the l2 distance computation to make search faster and takes up bytes = 4 * ivf_nlist * pq_m * 2^pq_code_size memory. See https://github.com/facebookresearch/faiss/blob/v1.7.4/faiss/IndexIVFPQ.cpp#L388-L416

The dynamically computed tables for HNSWPQ and IVFPQ-l2 pose a more serious issue. First, they are large. In
environments with a large amount of segments per node, they quickly occupy a significant chunk of memory. See graphs
below for comparisons when they are and are not duplicated. Second, they are not accounted for in the memory management
system because they are not serialized with the faiss index files, leading to unreliable memory throttling.

Out of Scope/Future work

  1. Index/Query time redundant allocations — While there are duplicated allocations happening during indexing and querying, this issue only focuses on redundant allocations happening at index load (search path) time. We will take this up in the future.
  2. Redundant storage — For some of the redundant allocations, we also redundantly store them on disk. This is inefficient, but out of scope for this document. We will take this up in the future.
  3. Redundant allocations that are serialized - as mentioned before, the centroid tables for IVF and PQ are also duplicated. We will take this up in the future. But they are outside of the scope of this issue because they do not consume as much memory and are properly tracked in the memory management system and sharing them would require a more extensive change.

Solution

For 2.13, we will focus on removing redundancy for the dynamically computed tables for HNSWPQ and IVFPQ-l2. In the
future, we will remove more redundant allocations (another issue to come).

HNSWPQ

The HNSWPQ solution is fairly straightforward, because the table is not used during search time. So, we can just
disable. To do this, we have added a flag to disable its computation in
faiss (facebookresearch/faiss#3246). We will apply this flag to our index load logic in the
faiss wrapper.

jlong knn_jni::faiss_wrapper::LoadIndex(knn_jni::JNIUtilInterface * jniUtil, JNIEnv * env, jstring indexPathJ) {
    if (indexPathJ == nullptr) {
        throw std::runtime_error("Index path cannot be null");
    }

    std::string indexPathCpp(jniUtil->ConvertJavaStringToCppString(env, indexPathJ));
    faiss::Index* indexReader = faiss::read_index(indexPathCpp.c_str(), faiss::IO_FLAG_READ_ONLY | faiss::IO_FLAG_PQ_SKIP_SDC_TABLE);
    return (jlong) indexReader;
}

This change will remove the dynamically loaded table and cause no regression in performance

Pros:

  1. Easy to implement
  2. Removes redundant state

Cons: none I can think of

IVFPQ-l2

For IVFPQ-l2, the table being computed is used during search time and it does impact search latency
(see APPENDIX B). The table is specific to the l2 space type
and is not computed for inner-product indices. From a high level, faiss breaks down the l2 distance computation into
portions that are index data dependent and independent, and pre-computes the data independent portions on load and thus
is able to perform fewer operations during query. In order to build the table, faiss needs access to the pq and ivf
centroids. For more details about the table, please refer this
faiss comment.

With typical values ivf_nlist=2048, pq_m=64 and pq_code_size=8 and 100 segments on the node, this table will yield a
memory footprint of about 12.5GB, when only 0.125 GB is needed.

In order to share this table amongst faiss indices, we will need to take the following 3 items:

  1. Change faiss to be able to share this table across different faiss indices
  2. Consume faiss changes in our JNI layer
  3. Manage shared allocations into the plugin

For (item 1) In faiss, we will need to change the IVFPQ index structure to switch from storing the table by value to
storing the table by reference. In doing this, we will need to ensure deallocation of the table happens in a safe and
backwards compatible way. There is a discussion in faiss (facebookresearch/faiss#3271) around
the best way to do this. There is a possibility we can introduce context in order to switch re-use the table. However,
this will impact serialization, so will need to be taken up in the future. Until then, we will maintain a patch to
introduce the functionality.

For (item 2) In the k-NN JNI layer, we will need to add methods to properly load, share and free the table. To do this,
we will modify/add several functions. The JNI layer will be responsible for correctly interpreting the
SharedIndexStateAddress based off of the indexType:

/**
 * [MODIFIED] We will need to modify this function to disable precomputing the 
 * ivfpq-l2 table during index reading
 * 
 * @param indexPath path to index
 * @return Address of loaded faiss index
 */
long loadIndex(string indexPath);

/**
 * [ADDED] We will add a method to initialize the shared index state
 * 
 * @param indexAddr address of the index to initialize from
 * @param indexType integer representing the type of index we are initializing for
 * @return Address of shared index state address
 */
long initSharedIndexState(long indexAddr, int indexType);

/**
 * [ADDED] Set the index state for an index
 *
 * @param indexAddr address of index to set state for
 * @param shareIndexStateAddr address of shared state to be set
 * @param indexType integer representing the type of index we are setting state for
 */
void setSharedIndexState(long indexAddr, long shareIndexStateAddr);

/**
 * [ADDED] Deallocate memory of the shared index state
 * 
 * @param shareIndexStateAddr address of shared state
 * @param indexType
 */
void freeSharedIndexState(long shareIndexStateAddr, int indexType);

For (item 3) ideally we would like to guarantee the following via management:

  1. We accurately estimate the memory the table will consume
  2. We guarantee that use-after-free scenarios can never happen with the shared table
  3. One copy of a table is in memory at a time
  4. Once no index is using the table, it is promptly deallocated from memory
  5. An index with shared state is not available for search until its shared state has been set

For 2.13, we will focus on guarantees 2-5. We are forgoing guarantee 1 because it is infeasible to introduce a new
allocation type other IndexAllocations depend on in the current architecture of the NativeMemoryCache. The current
memory system’s memory limitations are built completely around the NativeMemoryCache. So, the system estimates will not
be completely accurate, but by simply removing the redundant allocation they will be much more accurate than they are
currently. In a later phase, we will work on addressing the issue by refactoring the memory management system.

In order to track SharedIndexState, we will create a new cache to manage the allocations. During index loading,
SharedIndexStateAllocations will be retrieved from this cache. The cache will be implemented with thread safety
mechanisms. SharedIndexStateAllocations will be reference counted. The following diagram shows the new IndexAllocation
load workflow.

image

Once an IndexAllocation is deallocated, the associated SharedIndexStateAllocation’s reference count will be decremented
by 1. Once the reference count hits 0, the SharedIndexStateAllocation will be freed.

image

For (item 3) We will need to add a UUID per-model in the cluster. This is required because we will be caching shared
model state and will need to be able to differentiate different versions of the model (i.e. situation where user
creates model, build index, deletes model and creates another model with same id, etc.). For this work, please refer to
(TO BE CREATED ISSUE).
We are holding off on the UUID work initially as this is bug in current state.

Pros:

  1. Properly removes redundant allocations
  2. Easy to implement in a predictable manner
  3. Potential extendability in the future - we will need to refactor memory management to decouple management of different allocation types. Caching outside of the NativeMemoryCache will make this easier in the future.

Cons:

  1. Does not track dynamically allocated table in the existing memory management system

Alternatives Considered

Disable the IVFPQ-l2 table completely

We could disable the IVFPQ-l2 table by passing a flag at index load time.

Pros:

  1. Easiest to implement
  2. Memory management system estimates will be accurate

Cons:

  1. Query latency degradation - from APPENDIX B, we saw this resulted in a significant performance degradation

Initialize shared index state using model system index

Instead of using a loaded IndexAllocation to create the SharedIndexStateAllocation, we could read the model
from the model system index and compute the SharedIndexStateAllocation from here. Then, we pass this table to segments
after they are loaded.

Pros

  1. Decouple model components from data components will make future separation easier

Cons

  1. More complex implementation
  2. Will require transport call during index load, which could potentially impact stability

Pre-compute the table at model creation and serialize with the model

Pros:

  1. More efficient - table will only be computed once ever.

Cons:

  1. Will cause current memory management system to be inaccurate once we remove redundant allocs. It will represent state when there were redundant allocs, so we will over-estimate
  2. Complex to selectively load the table during index reading

End State

For IVFPQ-l2 and HNSWPQ, the dynamically computed tables will no longer be replicated in the system. The memory
management estimate will be more accurate than before, but still not complete because the IVFPQ-l2 shared allocations
will not be tracked.

For HNSWPQ, the memory estimate for customers will be the following, where the only per-segment overhead will be the
centroid table:

bytes = 1.1 * ((pq_code_size / 8 * pq_m + overhead + 8 * hnsw_m) * num_vectors  + num_segments*(2^pq_code_size * 4 * d)

where overhead is 24 bytes (determined empirically).

An example projection before and after the fix, with the ideal, will look like:

image

For IVFPQ-l2, the memory estimate for customers will be the following, where the per-segment overhead will be the PQ
centroid table and the ivf bucket table:

bytes = 1.1 * ((pq_code_size / 8 * pq_m + overhead) * num_vectors + num_segments * (2^pq_code_size * 4 * d + 4 * ivf_nlist * d) + pq_m * 2^pq_code_size * ivf_nlist * 4)

An example projection before and after the fix, with the ideal, will look like:
image

Backwards Compatibility

This solution is backwards compatible for the following reasons:

  1. We are not changing any serialization/deserialization logic
  2. We are not adding any additional cluster state
  3. Change does not interact with transport layer

Metrics

No metrics will be added with this change. The main reason being that in the future, we will be refactoring the memory
management system. If we add more metrics here, we will just have to deprecate them in the future, which does not make
sense.

We will add relevant logging metrics.

Appendix A: Experiments Showing Memory Gaps

This section details several experiments that were conducted to show the difference between what we estimate, what the
plugin estimates and what we observe.

Experimental Procedure

  1. Create training index
  2. Ingest training vectors (random data)
  3. Train model
  4. Create index from model
  5. Ingest vectors (random data)
  6. Force merge to 1 segment
  7. Stop/Start docker to reset memory
  8. Track memory usage
  9. Warmup index
  10. Track memory usage
  11. Calculate sum of all faiss files

IVF

image

Conclusions

  1. For IVF, the estimate when accounting for num_segments is consistent with file size estimate and formula estimate, for both inner product and l2 distances

HNSWPQ

image

Conclusions

  1. Revised estimate (est 4) including sdc allocation closely approximates actual memory consumption for both l2 and innerproduct
  2. removing sdc table improves memory consumption significantly

IVFPQ

image

Conclusions

  1. The l2 allocation adds a very large overhead

Appendix B: Experiments with and without table IVFPQ-l2

Experiment shows latency implications for IVFPQ-L2 with and without the precomputed table enabled.

Experimental Procedure

  1. Start docker container with OpenSearch
  2. Create training index
  3. Ingest training vectors (random data)
  4. Train model
  5. Create index from model
  6. Ingest vectors (random data)
  7. Force merge to 1 segment
  8. Stop/Start docker to reset memory
  9. Warmup index
  10. Run search workload

Results w/ optimized query

image

Results w/o optimized query

image

Conclusions

  1. The precomputed table provides significant latency improvements for optimized queries and also improvements for non optimized queries.
@jmazanec15
Copy link
Member Author

Closing as everything is merged. Will open up another issue for more generalized sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Fixes Changes to a system or product designed to handle a programming bug/glitch v2.13.0
Projects
None yet
Development

No branches or pull requests

1 participant