-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type-erased IVFFlatIndex class #154
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pull request has been linked to Shortcut Story #36874: Create type-erased |
Closed
lums658
added a commit
that referenced
this pull request
Jan 24, 2024
## Type Erased FeatureVector and FeatureVectorArray This is a recap of the feature vector and feature vector array type-erasure component of #154. * The primary contribution of this PR is the implementation of type-erased `FeatureVector` and `FeatureVectorArray`. Since one of the goals of having the type-erased classes is to seamlessly integrate with Python, the Python bindings for these classes were also added. * Some updating of matrix and vector I/O was done since the type-erased classes were just copied over. To update matrix and vector, the files implementing them were copied over the existing files, with some minimal changes. * The cpo `num_vectors()` replaced `size()` in numerous places. **NOTE:** The type-erased Python binding files that were copied over included code for index classes. This code has been temporarily commented out (usually with #ifdef 0), pending the next PR. ## Overview of Type Erasure (See also the README.md in include/api). Type erasure is accomplished as a three-layer cake: * A non-templated abstract base class with pure virtual functions for the member functions that we want to expose from the typed C++ classes. * An implementation class that inherits from the abstract base class. It is templated on the concrete class to be wrapped and keeps a member variable that is an object of the class to be wrapped. Its member functions (all of which are overrides of the abstract base class member functions) forward to the underlying C++ implementation by invoking the appropriate member. At this point, they internal data that is stored by the typed member variable also needs to be converted to the appropriate type before invoking the member variable function. * A non-templated class that presents the user API. It internally defines the abstract base class and the implementation class. It has a `std::unique_ptr` to the abstract base class as a member variable. During construction (either by passing in already constructed vectors or by reading the index from a TileDB group, the appropriate template types for the internal data to be stored by the internal implementation are inferred and an object of the implementation class is constructed and stored in the `std::unique_ptr`. To illustrate the basic idea, consider `FeatureVector`. In abbreviated form, where we just show a single function 'data', looks like this: ```c++ class FeatureVector { FeatureVector(const tiledb::Context& ctx, const std::string uri) { // get type of vector stored in uri array -- say, float feature_vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri); } auto data() { return feature_vector_->data(); } class Base { virtual void* data() = 0; }; template <class T> class Impl { explicit Impl(T&& t) : impl_vector_(std::forward<T>(t)) { } T impl_vector_; }; std::unique<Base> feature_vector_; }; The constructor to read the `FeatureVector` from a TileDB array has the following prototype: ```c++ FeatureVector(const tiledb::Context& ctx, const std::string& uri); ``` When that constructor is invoked, it first reads the schema associated with the `uri` and creates an implementation object based on that type. For example, if the type read from the schema (`feature_type`) is one of a `float` or `uint8`, the constructor dispatches like this: ```c++ switch (feature_type_) { case TILEDB_FLOAT32: vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri); break; case TILEDB_UINT8: vector_ = std::make_unique<vector_impl<tdbVector<uint8_t>>>(ctx, uri); break; } ``` At this point, we have created a `std::unique_ptr` of the abstract base class that points to an object of the derived class. If we invoke the `data` member function of the outer (type-erased) `FeatureVector` class, we dispatch to the corresponding member of the object stored in the `std::unique_ptr`: ```c++ auto data() const { return feature_vector_->data(); } ``` Since `feature_vector_` actually points to the derived implementation class, its `data` member function is then invoked: ```c++ void* data() override { return impl_vector_->data(); } ``` We return a `void*` since `data()` is an override of the non-templated `Base` class.
Changes here incorporated in PR #210. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a type-erased API for the C++
ivf_index
class. It builds onivf_index
class).ivf_index
class (completing the initial class with kmeans).See the comments for each of those PRs for information on their implementations.
(This PR was branched from main and then a squash merge was made from
lums/tmp/type_erased_ivf_index
).Type Erasure
Type erasure is accomplished as a three-layer cake:
std::unique_ptr
to the abstract base class as a member variable. During construction (either by passing in already constructed vectors or by reading the index from a TileDB group, the appropriate template types for the internal data to be stored by the internal implementation are inferred and an object of the implementation class is constructed and stored in thestd::unique_ptr
.To illustrate the basic idea, consider
FeatureVector
. In abbreviated form, where we just show a single function 'data', looks like this:When that constructor is invoked, it first reads the schema associated with the
uri
and creates an implementation object based on that type. For example, if the type read from the schema (feature_type
) is one of afloat
oruint8
, the constructor dispatches like this:At this point, we have created a
std::unique_ptr
of the abstract base class that points to an object of the derived class.If we invoke the
data
member function of the outer (type-erased)FeatureVector
class, we dispatch to the corresponding member of the object stored in thestd::unique_ptr
:Since
feature_vector_
actually points to the derived implementation class, itsdata
member function is then invoked:We return a
void*
sincedata()
is an override of the non-templatedBase
class.(TODO: In a future PR maybe we can cast to an appropriate type extracted from the type of
vector_
?)(TODO: Is there a way to condense the boilerplate that is currently contained in all of these?)
New Features
The major features in this PR:
index
andapi
. There are files in each of the folders for flat_l2, flat_pq, ivf_flat, ivf_pq, vamana, and vamana_pq. All of these indexes except ivf_pq and vamana_pq have been implemented in C++; flat_l2 and ivf_flat also have type-erased implementions.FeatureVector
andFeatureVectorArray
fromapi.h
and put them into their own filesapi/feature_vector.h
andapi/feature_vector_array.h
.FeatureVector
is the type-erased api fortdbVector
andFeatureVectorArray
is the type-erased interface totdbMatrix
. UnliketdbMatrix
FeatureVectorArray
always loads all of its data.FeatureVector
is based ontdbVector
which always loaded all of its data.partitioned_matrix
that represents a completely in-memory partitioned matrix (similar to CSR, but over vectors rather than single matrix entries).tdb_partitioned_matrix
that loads apartitioned_matrix
with data from a TileDB group. It does not load any data upon creation, but rather loads data on invocation of theload
member function. The queries that support out of core operation now take afeature_vector_array
(concept) as an argument for its partitioned data and assumes no data has been loaded. They now loop over the out of core partitions with the following pattern:MatrixView
class that is a lightweight non-owning wrapper, similar tostd::mdspan
. It is used when converting data in a type-erased class to a format usable in a typed C++ class.ivf_flat_index
C++ class to interface more cleanly to C++ queries. This class has query member functions for all of theivf/qv
queries, for the purposes of benchmarking.ivf_flat_index
andflat_l2_index
classes. The CLI programs are now divided into and "index" program that ingests raw data and writes an index, and a "query" program that reads an index and applies a query. CLI programs are available forflat_l2
,ivf_flat
,pq_flat
, andvamana
. These have been tested but not seriously benchmarked.qv_query_infinite_ram
andqv_query_finite_ram
, respectively. With future benchmarking and profiling, we will ensure that those are the best performing queries and that they have the same or better performance than the CLI and Python benchmarks used for our first blog post.tiledb::Array
member oftdbMatrix
in astd::unique_ptr
due to an issue with moving atiledb::Array
. Also changed the correspondingtdb_helper
to return astd::unique_ptr<tiledb::Array>
rather than a plantiledb::Array
.