-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge #162
Merge #162
Conversation
This pull request has been linked to Shortcut Story #32552: Implement NN-Descent to create nearest neighbor graph. |
The idea here was to do a merge with the latest main so that there were no conflicts with main and so that it could be quickly approved and merged back into main. From there, we can rebase #153, #154, #155, and #161 so that their diffs will not be so large and hopefully get them reviewed and approved relatively quickly so that we can make and keep main consistent with the latest goings on with C++. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has a lot of features packaged in a single review cycle.
In terms of release safety, we definitely need CI to be enabled and the test to pass. I would also prefer to split this in multiple PRs that contain the distinct feature functionalities (i.e. CPOs, concepts, Diskann, Vamana, C++ API)
@@ -34,6 +34,9 @@ def get_cmake_overrides(): | |||
if val: | |||
conf.append("-DUSE_MKL_CBLAS={}".format(val)) | |||
|
|||
conf.append("-DTileDB_DIR=/Users/lums/Contrib/dist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
src/cmake/HighFive.cmake
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
uint32_t npoints{0}; | ||
uint32_t ndim{0}; | ||
|
||
std::ifstream binary_file(path, std::ios::binary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be TileDB Array, what is the plan to use TileDB storage for diskann?
|
||
for (size_t i = 0; i < nvertices; ++i) { | ||
for (auto&& [_, j] : out_edges(g, i)) { | ||
// print_types(i, j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented lines here and in other parts of the file. If something need to remain add a comment for why it is needed.
return num_updates; | ||
} | ||
|
||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove dead code if not needed
} | ||
(void)active_partitions; | ||
// if constexpr | ||
// (!has_num_col_parts<std::remove_cvref_t<decltype(partitioned_db)>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove lines that are commented out. Same bellow
/** | ||
* @file matrix.h | ||
* @file index.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit matrix
@@ -102,12 +169,15 @@ class Matrix : public stdx::mdspan<T, matrix_extents<I>, LayoutPolicy> { | |||
Base::operator=(Base{storage_.get(), num_rows_, num_cols_}); | |||
} | |||
|
|||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid dead code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove if not needed
src/include/test/unit_high_five.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
271d5ba
to
6951081
Compare
…TileDB-Vector-Search into lums/sc-32552/nn-descent-too
1ab2f75
to
d05aab2
Compare
…TileDB-Vector-Search into lums/sc-32552/nn-descent-too
#173+ |
This PR supersedes #149 -- it comprises
main
with a squash merge of #149. The test below is from the original PR.This PR includes the implementation of two graph-based indexing algorithms: nn-descent and vamana. The cpp_api branch was also merged into this one, so there is also code in this PR from that branch.
CPP API
For C++ library API
Vector
,FeatureVector
, andIndex
. The type-erased classes take a URI and read in the corresponding TileDB array. The underlying type of theVector
,FeatureVector
, orIndex
class is determined by the data type of the TileDB array.feature_vector
is a contiguousrandom_access_range
with the addition ofoperator[]
on the container. Moreover, the expressiondimension(v)
must be valid for a type to meet the requirements offeature_vector
(which can be implemented with a customization point object, or CPO).feature_vector_range
is a container offeature_vector
. The ithfeature_vector
is obtained viaoperator[]
. The expressionsdimension(r)
andnum_vectors(r)
must be valid.feature_vector_range
is not actually a range (in the sense ofstd::ranges
, since we don't provide the required iterators) so we will need to change the name at some point. Or we could consider requiringfeature_vector_range
actually meet the requirements ofstd::range
-- in which case it would need to have all the iterator plumbing, etc. Providing this capability is not too difficult, requiring only a fairly straightforward facade.num_vectors()
,dimension()
to provide abstract size data for set of vectors (rather thannum_rows()
,num_cols()
)load()
- abstracts away loading data into a matrix. Is a nop for aMatrix
, will load data into atdbMatrix
Further descriptions about the CPP_API PR can be found in PR 131.
Graph-Based Indexing and Search
nn_graph
that keeps the neighbors in afixed_min_heap
in order to store only the neighbors with the top-n scoresGraph creation and search algorithms
nn_graph
vamana_index
class that builds an index according to the vamana algorithm as described in the "filtered fresh DiskANN" paper (not either the original DiskANN paper nor the fresh DiskANN paper).greedy_search
androbust_prune
(which are the names given in the original DiskANN paper).Utility and helper functions
Demo applications
demo_*
programs in the test subdirectory are essentially integration tests that output qualitative data for assessing proper functioning of key algorithms in building the graph-based indexes (such as dumping graph information to disk so that the graphs can be read and plotted by networkx). Two notebooks are provided for examining data dumped byunit_vamana
andvamana
:plot_unit_vamana.ipynb
andplot_vamana.ipynb
.Unit tests
CLI drivers
Benchmark scripts
Cleanup of unused files
NB A number of the unit tests read and write arrays from the local file system. These are hard-coded to my personal development environment since we are not currently pulling those into a uniform location (this should be fixed at some point so that test arrays are pulled in manually by running a script or automatically as part of cmake).