Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge #162

Closed
wants to merge 21 commits into from
Closed

Merge #162

wants to merge 21 commits into from

Conversation

lums658
Copy link
Contributor

@lums658 lums658 commented Dec 11, 2023

This PR supersedes #149 -- it comprises main with a squash merge of #149. The test below is from the original PR.

This PR includes the implementation of two graph-based indexing algorithms: nn-descent and vamana. The cpp_api branch was also merged into this one, so there is also code in this PR from that branch.

CPP API

For C++ library API

  • api.h - Type-erased interface between Python and C++, including Vector, FeatureVector, and Index. The type-erased classes take a URI and read in the corresponding TileDB array. The underlying type of the Vector, FeatureVector, or Index class is determined by the data type of the TileDB array.
  • concepts.h - Initial introduction of C++20 concepts. A collection of concepts for defining the API to the vector search library. This allows us to abstract away the format of the data storage (e.g., we don't need to worry about row-major or column-major).
  • Use concepts for template parameters in function templates. Replace type-specific calls in the library (e.g., in query functions, matrix classes) with CPOs. There will be many small diffs due to this replacement. The two most important concepts are
    • A feature_vector is a contiguous random_access_range with the addition of operator[] on the container. Moreover, the expression dimension(v) must be valid for a type to meet the requirements of feature_vector (which can be implemented with a customization point object, or CPO).
    • A feature_vector_range is a container of feature_vector. The ith feature_vector is obtained via operator[]. The expressions dimension(r) and num_vectors(r) must be valid. feature_vector_range is not actually a range (in the sense of std::ranges, since we don't provide the required iterators) so we will need to change the name at some point. Or we could consider requiring feature_vector_range actually meet the requirements of std::range -- in which case it would need to have all the iterator plumbing, etc. Providing this capability is not too difficult, requiring only a fairly straightforward facade.
  • cpos.h - Customization point objects to work in conjunction with the concepts, including
    • num_vectors(), dimension() to provide abstract size data for set of vectors (rather than num_rows(), num_cols())
    • load() - abstracts away loading data into a matrix. Is a nop for a Matrix, will load data into a tdbMatrix

Further descriptions about the CPP_API PR can be found in PR 131.

Graph-Based Indexing and Search

  • Graph classes
    • adj_list.h - Two adjacency list classes, one that just stores indices and one that stores indices and scores
    • nn-graph.h - An adjacency list class nn_graph that keeps the neighbors in a fixed_min_heap in order to store only the neighbors with the top-n scores

Graph creation and search algorithms

  • nn-descent.h - Implements the nn-descent algorithm, using an nn_graph
  • vamana.h - Implements a vamana_index class that builds an index according to the vamana algorithm as described in the "filtered fresh DiskANN" paper (not either the original DiskANN paper nor the fresh DiskANN paper).
    • The index is built by iterating over the vectors in the data set and processing each vector as an insert into the index.
    • When a vector is added, a best-first path from the medoid (the vector closest to the middle of the vector data set) is computed.
    • The path is then pruned to keep the degree of the vertexes on that path below a given threshold.
    • The functions for performing the best-first search and the pruning are greedy_search and robust_prune (which are the names given in the original DiskANN paper).

Utility and helper functions

  • diskann.h - functions for reading DiskANN memory index files

Demo applications

  • The demo_* programs in the test subdirectory are essentially integration tests that output qualitative data for assessing proper functioning of key algorithms in building the graph-based indexes (such as dumping graph information to disk so that the graphs can be read and plotted by networkx). Two notebooks are provided for examining data dumped by unit_vamana and vamana: plot_unit_vamana.ipynb and plot_vamana.ipynb.

Unit tests

  • Unit tests for the new classes and functions are provided in include/test.

CLI drivers

  • Two CLI drivers are provided for vamana indexing: src/vamana/index.cc which builds a vamana-based index for a specified TileDB set of vector data, and src/vamana/query.cc that queries a vamana-based index. As with previous C++ CLI programs, these print extensive logging data.

Benchmark scripts

  • src/benchmarks/vamana - diskann.bash and vamana.bash respectively run indexing and querying programs from DiskANN and TileDB on some small-ish datasets (datasets that will fit in memory) and report timings (etc). DiskANN is only configured to compile and run on x86 Linux, so these scripts need to be run on an appropriate platform.

Cleanup of unused files

NB A number of the unit tests read and write arrays from the local file system. These are hard-coded to my personal development environment since we are not currently pulling those into a uniform location (this should be fixed at some point so that test arrays are pulled in manually by running a script or automatically as part of cmake).

Copy link

This pull request has been linked to Shortcut Story #32552: Implement NN-Descent to create nearest neighbor graph.

@jparismorgan
Copy link
Contributor

Is this PR meant to go before #161, which looks like a superset of this? Generally seems good, though would be good to apply my review on #161 here before landing.

Approving, though it would be good to get another review here as well.

@lums658
Copy link
Contributor Author

lums658 commented Dec 14, 2023

Is this PR meant to go before #161, which looks like a superset of this? Generally seems good, though would be good to apply my review on #161 here before landing.

Approving, though it would be good to get another review here as well.

The idea here was to do a merge with the latest main so that there were no conflicts with main and so that it could be quickly approved and merged back into main. From there, we can rebase #153, #154, #155, and #161 so that their diffs will not be so large and hopefully get them reviewed and approved relatively quickly so that we can make and keep main consistent with the latest goings on with C++.

Copy link
Collaborator

@NikolaosPapailiou NikolaosPapailiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has a lot of features packaged in a single review cycle.

In terms of release safety, we definitely need CI to be enabled and the test to pass. I would also prefer to split this in multiple PRs that contain the distinct feature functionalities (i.e. CPOs, concepts, Diskann, Vamana, C++ API)

@@ -34,6 +34,9 @@ def get_cmake_overrides():
if val:
conf.append("-DUSE_MKL_CBLAS={}".format(val))

conf.append("-DTileDB_DIR=/Users/lums/Contrib/dist")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

uint32_t npoints{0};
uint32_t ndim{0};

std::ifstream binary_file(path, std::ios::binary);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be TileDB Array, what is the plan to use TileDB storage for diskann?


for (size_t i = 0; i < nvertices; ++i) {
for (auto&& [_, j] : out_edges(g, i)) {
// print_types(i, j);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented lines here and in other parts of the file. If something need to remain add a comment for why it is needed.

return num_updates;
}

#if 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dead code if not needed

}
(void)active_partitions;
// if constexpr
// (!has_num_col_parts<std::remove_cvref_t<decltype(partitioned_db)>>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove lines that are commented out. Same bellow

/**
* @file matrix.h
* @file index.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit matrix

@@ -102,12 +169,15 @@ class Matrix : public stdx::mdspan<T, matrix_extents<I>, LayoutPolicy> {
Base::operator=(Base{storage_.get(), num_rows_, num_cols_});
}

#if 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid dead code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove if not needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

@ihnorton ihnorton changed the title Merge [skip ci] Merge Dec 18, 2023
@ihnorton ihnorton closed this Dec 18, 2023
@ihnorton ihnorton reopened this Dec 18, 2023
@ihnorton ihnorton force-pushed the lums/sc-32552/nn-descent-too branch from 271d5ba to 6951081 Compare December 22, 2023 15:27
@ihnorton ihnorton force-pushed the lums/sc-32552/nn-descent-too branch from 1ab2f75 to d05aab2 Compare January 5, 2024 13:39
@ihnorton
Copy link
Member

ihnorton commented Feb 1, 2024

#173+

@ihnorton ihnorton closed this Feb 1, 2024
@ihnorton ihnorton mentioned this pull request Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants