-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE] raft v24.02 #2134
Merged
Merged
[RELEASE] raft v24.02 #2134
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Forward-merge branch-23.12 to branch-24.02
Forward-merge branch-23.12 to branch-24.02
The version doesn't need to be hardcoded into pyproject.toml files anymore, but it looks like update-version.sh wasn't updated to account for that. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ray Douglass (https://github.com/raydouglass) - Corey J. Nolet (https://github.com/cjnolet)
Forward-merge branch-23.12 to branch-24.02
Forward-merge branch-23.12 to branch-24.02
Forward-merge branch-23.12 to branch-24.02
Forward-merge branch-23.12 to branch-24.02
At some point, `ci/checks/copyright.py` implementation diverged from other RAPIDS repos. This PR uses https://github.com/rapidsai/cudf/blob/branch-24.02/ci/checks/copyright.py as a reference to update the script. This new implementation uses git history to figure out the year in which a file was last modified and then adds that to the copyright year. The PR also: 1. Excludes thirdparty files/licences 2. Adds missing copyright headers Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Bradley Dice (https://github.com/bdice) - Jake Awe (https://github.com/AyodeAwe) URL: #2008
Forward-merge branch-23.12 to branch-24.02
This PR now exports 3 CSVs from search result JSON files of the suffixes: 1. `raw`: All results 2. `throughput`: Pareto frontier of throughput results 3. `latency`: Pareto frontier of latency results The Pareto frontier is now no more created in `raft-ann-bench.plot`. Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2009
Forward-merge branch-23.12 to branch-24.02
Forward-merge branch-23.12 to branch-24.02
Enable host input data for IVF-Flat build. This is done by batch-wise processing the dataset during extend, similarly how IVF-PQ does it. Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1635
The C++ side will automatically reset this constraint to a valid setting, however this is leading to additional param settings being trained and searched unecessarily. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #2016
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #2025
Add a usage example for using the brute_force index api for building and searching. Also fix some minor compile time errors in the vector search tutorial Authors: - Ben Frederickson (https://github.com/benfred) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2029
This PR updates to fmt 10.1.1 and spdlog 1.12. Depends on rapidsai/rmm#1374. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Jake Awe (https://github.com/AyodeAwe) - Vyas Ramasubramani (https://github.com/vyasr)
Remove the selection_faiss instantiations. Since #1985, we haven't been using the faiss select_k code and these aren't necessary anymore. This should lead to a 70MB improvement in libraft.so binary size. This also removes the raft::spatial::select_k code in favour of matrix:: select_k - the spatial version was marked deprecated, and didn't switch between the best selection algorithms for the input size. Authors: - Ben Frederickson (https://github.com/benfred) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2027
Authors: - Micka (https://github.com/lowener) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1996
There is an entry missing from the `update-version.sh` for ucx-py wheels, so ucx-py is pinned at 0.35 instead of 0.36. This was probably overlooked when adding devcontainers. This updates to the correct versions and fixes the version update script. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Ray Douglass (https://github.com/raydouglass) URL: #2035
Forward-merge branch-23.12 to branch-24.02
…afe (#2030) Update device_resources_manager to reuse only the memory manager, stream, and stream pools across threads. Create a unique resources object per device for each thread, since the resources object is not thread-safe. Authors: - William Hicks (https://github.com/wphicks) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2030
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #2041
This PR fixes the typos in the ANN benchmark parameter tuning guide regarding `refine_ratio` and dataset location. Specifically, 1. Default refine ratio should be 1 instead of 0 - RAFT IVF-PQ. [link](https://github.com/rapidsai/raft/blob/branch-24.02/cpp/bench/ann/src/raft/raft_ivf_pq_wrapper.h#L99) - FAISS GPU. [link](https://github.com/rapidsai/raft/blob/branch-24.02/cpp/bench/ann/src/faiss/faiss_gpu_wrapper.h#L89) 2. Default dataset location - RAFT IVF-Flat. Should be mmap instead of device. [link](https://github.com/rapidsai/raft/blob/branch-24.02/cpp/bench/ann/src/raft/raft_ivf_flat_wrapper.h#L81) - RAFT IVF-PQ. Should be host instead of device. [link](https://github.com/rapidsai/raft/blob/branch-24.02/cpp/bench/ann/src/raft/raft_ivf_pq_wrapper.h#L84) - RAFT CAGRA. Should be mmap instead of device. [link](https://github.com/rapidsai/raft/blob/branch-24.02/cpp/bench/ann/src/raft/raft_cagra_wrapper.h#L113) And I think we can unify the dataset location to either mmap or host. Furthermore, to enable better copy performance and enable kernel/copy overlap, RAFT should also support `pinned_host` as one of the memory types. I can open a separate issue for it if people think it's reasonable. Authors: - Rui Lan (https://github.com/abc99lr) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2034
Forward-merge branch-23.12 to branch-24.02
Some minor simplification in advance of the scikit-build-core migration to better align wheel and non-wheel Python builds. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Ray Douglass (https://github.com/raydouglass) URL: #2040
This PR changes all references to pypi.nvidia.com to pypi.anaconda.org/rapidsai-wheels-nightly. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ray Douglass (https://github.com/raydouglass) URL: #2042
Found when using a newer version of gcc Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Ben Frederickson (https://github.com/benfred) URL: #2045
Add serialization and deserialization methods for brute_force index. Also add overloads to brute_force search and build functions taking index_param and search_param arguments for API compatibility with other index types. Authors: - William Hicks (https://github.com/wphicks) Approvers: - Ben Frederickson (https://github.com/benfred) - Corey J. Nolet (https://github.com/cjnolet) URL: #2036
This PR replaces the current cublas `gemm` backend of `raft::linalg::gemm` with cublasLt `matmul`. The latter is more flexible and allows to decouple selection of the algorithm heuristics from its execution. Thanks to this change, this PR adds memoization of the matmul heuristics and the other arguments (matrix layouts and the matmul descriptor). #### Performance on specific workloads IVF-PQ performs two gemm operations during pre-processing on small work sizes. The preprocessing consists of a few kernel launches and a rather heavy logic on CPU side (which results in gaps between the kernel launches). This PR **roughly halves the gemm kernel launch latency** (approx 10us -> 5us, as measured by NVTX from entering `matmul` wrapper on the host to the launch of the kernel). As a motivation example: this PR improves QPS of IVF-PQ by ~5-15% on small batches (tested on SIFT-128, n_queries = 1, n_probes = 20 and 200) . #### Synthetic benchmarks: no significant difference Running all 4K+ benchmarks across RAFT does not bring significant difference in CPU/GPU exec time. - Overall, the average exec time reduction of ~0.5% - 100+ benchmarks show 5-10% time reduction - 9 benchmarks show 5-10% time increase (none of them use GEMM) Only a small fraction of RAFT benchmarks actually use GEMM, so most of the stronger deviations are likely due to pure chance. Having no gain across all benchmarks is not surprising, because we've designed most of them for somewhat larger work sizes, which hides the gemm latency. Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #1736
This bug happened when trying to run a CAGRA index with `itopk=100` and `topk=100`. The `num_cta_per_query` variable was equal to 3 because 100 / 32 = 3.125 instead of ceildiv(100, 32) = 4. This resulted in the following error: ``` RuntimeError: RAFT failure at file=/opt/conda/conda-bld/work/cpp/include/raft/neighbors/detail/cagra/search_multi_cta.cuh line=183: `num_cta_per_query` (3) * 32 must be equal to or greater than `topk` (100) when 'search_mode' is "multi-cta". (`num_cta_per_query`=max(`search_width`, `itopk_size`/32)) ``` Authors: - Micka (https://github.com/lowener) Approvers: - tsuki (https://github.com/enp1s0) - Corey J. Nolet (https://github.com/cjnolet) URL: #2107
This change allows CAGRA search to have an arbitrarily large top-k, instead of being limited to 1024 like in the previous code. This works by using the multi-kernel search path, and replacing the _cuann_find_topk code with the matrix::select_k code - which can handle large K values. Authors: - Ben Frederickson (https://github.com/benfred) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2097
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #2117
This PR adds support for an eps-neighborhood search based on random ball cover. The algorithm re-uses the existing rbc index creation. Changes: * add C++ API `raft::neighbors::ball_cover::epsUnexpL2NeighborhoodRbc` * lifted the 2/3-D limitation for eps-neighborhod via RBC (limitation still in place for k-nn queries) * pylibraft support for dense brute-force eps-neighborhood * pylibraft support for sparse/dense rbc epsneighborhood Note: The PR also contains a fix for the vertex degree computation in the brute force algorithm `spatial::knn::detail::epsUnexpL2SqNeighborhood`. Related to #1984 and #517 Authors: - Malte Förster (https://github.com/mfoerste4) - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #2028
Part of rapidsai/rmm#1388. This removes now-optional and soon-to-be deprecated functions from cuDF's custom device_memory_resource implementations: - `supports_get_mem_info()` - `do_get_mem_info()` Authors: - Mark Harris (https://github.com/harrism) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2108
This PR adds support to __half and nb_bfloat16 to abs and myinf Authors: - Nicolas Blin (https://github.com/Kh4ster) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1592
With the bug fix #2117 there can be an issue with `z_tmp` memory being uninitialized. SPMM formula is `Z = alpha . X * Y + beta . Z` so when `beta` is not zero, Z is being read. The proposed solution in this PR remove the need for an extra allocation and a copy from/to an external buffer, by creating a strided view of the original Z. Authors: - Micka (https://github.com/lowener) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2124
The CPU time stamp `start` is taken before the ANN algo is copied to all threads. This is fixed by initializing `start` a few lines later. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Artem M. Chirkin (https://github.com/achirkin) URL: #2084
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #2129
This PR address #1901 by subsampling the input dataset for PQ codebook training to reduce the runtime. Currently, a similar strategy is applied to `per_cluster` method, but not to the default `per_subset` method. This PR fixes this gap. Similar to the subsampling mechanism of the `per_cluster` method, we pick at minimum `256*max(pq_book_size, pq_dim)` number of input rows for training each code book. https://github.com/rapidsai/raft/blob/cf4e03d0b952c1baac73f695f94d6482d8c391d8/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L408 The following performance numbers are generated using Deep-100M dataset. After subsampling, the search time and accuracy are not impacted (within +-5%) except one case where I saw 9% performance drop on search (using 10K batch for search). More extensive benchmarking across datasets seems to be needed for justification. Dataset | n_iter | n_list | pq_bits | pq_dim | ratio | Original time (s) | Subsampling (s) | Speedup [subsampling] -- | -- | -- | -- | -- | -- | -- | -- | -- Deep-100M | 25 | 50000 | 4 | 96 | 10 | 129 | 89.5 | 1.44 Deep-100M | 25 | 50000 | 5 | 96 | 10 | 128 | 89.4 | 1.43 Deep-100M | 25 | 50000 | 6 | 96 | 10 | 131 | 90 | 1.46 Deep-100M | 25 | 50000 | 7 | 96 | 10 | 129 | 91.1 | 1.42 Deep-100M | 25 | 50000 | 8 | 96 | 10 | 149 | 93.4 | 1.60 Note, after subsampling, the PQ codebook generation is no longer a bottleneck in the IVF-PQ index building. More optimizations on PQ codebook generation seem unnecessary. Although we could in theory apply the custom kernel approach (#2050) with subsampling, my early tests show the current GEMM approach performs better than the custom kernel after subsampling. Using multiple stream could improve the performance further by overlapping kernels for different `pq_dim`, given kernels are small after subsampling and may not fully utilize GPU. However, as mention above, since the entire PQ codebook is fast, this optimization may not be worthwhile. TODO - [x] Benchmark the performance/accuracy impacts on multiple datasets Authors: - Rui Lan (https://github.com/abc99lr) - Ray Douglass (https://github.com/raydouglass) - gpuCI (https://github.com/GPUtester) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #2052
Closes #1772 Authors: - Divye Gala (https://github.com/divyegala) - Corey J. Nolet (https://github.com/cjnolet) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Micka (https://github.com/lowener) - Corey J. Nolet (https://github.com/cjnolet) URL: #2022
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
This PR conditionally includes `hnsw` sources, to prevent build errors like those seen in cuGraph after #2022 was merged. See also: rapidsai/cugraph#4121, rapidsai/cugraph#4122 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Bradley Dice (https://github.com/bdice) - Robert Maynard (https://github.com/robertmaynard)
There is an unusual test failure in raft-dask: ``` Exception: ModuleNotFoundError("No module named 'raft_dask.common.comms_utils'") ``` The tests started failing at the same time that pytest 8 was released. This PR tests with `pytest==7.*` to isolate the root cause (and fix it if it is the problem).
Closes #2141. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) - Tamas Bela Feher (https://github.com/tfeher)
RAFT C++ tests were not running for a portion of the 24.02 development cycle, until the merger of rapidsai/rapids-cmake#533. This PR fixes some failing tests and reverts PRs that caused test failures that were silent until now, specifically #2097 and #2085. These features will be revisited in a subsequent release. Authors: - Malte Förster (https://github.com/mfoerste4) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice)
Add additional constraint to is_row/col_major check. Authors: - Malte Förster (https://github.com/mfoerste4) Approvers: - Tamas Bela Feher (https://github.com/tfeher)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
❄️ Code freeze for
branch-24.02
and v24.02 releaseWhat does this mean?
Only critical/hotfix level issues should be merged into
branch-24.02
until release (merging of this PR).What is the purpose of this PR?
branch-24.02
intomain
for the release