Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type-erased IVFFlatIndex class #154

Closed
wants to merge 2 commits into from
Closed

Conversation

lums658
Copy link
Contributor

@lums658 lums658 commented Nov 10, 2023

This PR implements a type-erased API for the C++ ivf_index class. It builds on

Type Erasure

Type erasure is accomplished as a three-layer cake:

  • A non-templated abstract base class with pure virtual functions for the member functions that we want to expose from the typed C++ classes.
  • An implementation class that inherits from the abstract base class. It is templated on the concrete class to be wrapped and keeps a member variable that is an object of the class to be wrapped. Its member functions (all of which are overrides of the abstract base class member functions) forward to the underlying C++ implementation by invoking the appropriate member. At this point, they internal data that is stored by the typed member variable also needs to be converted to the appropriate type before invoking the member variable function.
  • A non-templated class that presents the user API. It internally defines the abstract base class and the implementation class. It has a std::unique_ptr to the abstract base class as a member variable. During construction (either by passing in already constructed vectors or by reading the index from a TileDB group, the appropriate template types for the internal data to be stored by the internal implementation are inferred and an object of the implementation class is constructed and stored in the std::unique_ptr.

To illustrate the basic idea, consider FeatureVector. In abbreviated form, where we just show a single function 'data', looks like this:

class FeatureVector {
    FeatureVector(const tiledb::Context& ctx, const std::string uri) {
      // get type of vector stored in uri array -- say, float
      feature_vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri);
    }
    auto data() {
     return feature_vector_->data();
   }

   class Base {
    virtual void* data() = 0;
  };

  template <class T>
  class Impl {
     explicit Impl(T&& t)
        : impl_vector_(std::forward<T>(t)) {
    }
    T impl_vector_;
  };

  std::unique<Base> feature_vector_;
};

The constructor to read the `FeatureVector` from a TileDB array has the following prototype:
```c++
FeatureVector(const tiledb::Context& ctx, const std::string& uri);

When that constructor is invoked, it first reads the schema associated with the uri and creates an implementation object based on that type. For example, if the type read from the schema (feature_type) is one of a float or uint8, the constructor dispatches like this:

switch (feature_type_) {
      case TILEDB_FLOAT32:
        vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri);
        break;
      case TILEDB_UINT8:
        vector_ = std::make_unique<vector_impl<tdbVector<uint8_t>>>(ctx, uri);
        break;
}

At this point, we have created a std::unique_ptr of the abstract base class that points to an object of the derived class.
If we invoke the data member function of the outer (type-erased) FeatureVector class, we dispatch to the corresponding member of the object stored in the std::unique_ptr:

 auto data() const {
    return feature_vector_->data();
  }

Since feature_vector_ actually points to the derived implementation class, its data member function is then invoked:

void* data() override {
      return impl_vector_->data();
    }

We return a void* since data() is an override of the non-templated Base class.
(TODO: In a future PR maybe we can cast to an appropriate type extracted from the type of vector_?)
(TODO: Is there a way to condense the boilerplate that is currently contained in all of these?)

New Features

The major features in this PR:

  • Rearranged typed and type-erased classes respectively into separate folders index and api. There are files in each of the folders for flat_l2, flat_pq, ivf_flat, ivf_pq, vamana, and vamana_pq. All of these indexes except ivf_pq and vamana_pq have been implemented in C++; flat_l2 and ivf_flat also have type-erased implementions.
  • Separated FeatureVector and FeatureVectorArray from api.h and put them into their own files api/feature_vector.h and api/feature_vector_array.h. FeatureVector is the type-erased api for tdbVector and FeatureVectorArray is the type-erased interface to tdbMatrix. Unlike tdbMatrix FeatureVectorArray always loads all of its data. FeatureVector is based on tdbVector which always loaded all of its data.
  • Created a new class partitioned_matrix that represents a completely in-memory partitioned matrix (similar to CSR, but over vectors rather than single matrix entries).
  • Created a new class tdb_partitioned_matrix that loads a partitioned_matrix with data from a TileDB group. It does not load any data upon creation, but rather loads data on invocation of the load member function. The queries that support out of core operation now take a feature_vector_array (concept) as an argument for its partitioned data and assumes no data has been loaded. They now loop over the out of core partitions with the following pattern:
while (partitioned_vectors.load()) {
  // query in-memory partitions
}
  • Created a MatrixView class that is a lightweight non-owning wrapper, similar to std::mdspan. It is used when converting data in a type-erased class to a format usable in a typed C++ class.
  • Completed the ivf_flat_index C++ class to interface more cleanly to C++ queries. This class has query member functions for all of the ivf/qv queries, for the purposes of benchmarking.
  • Created unit tests for all of the new C++ classes and all of the type-erased classes.
  • Created CLI programs for the new ivf_flat_index and flat_l2_index classes. The CLI programs are now divided into and "index" program that ingests raw data and writes an index, and a "query" program that reads an index and applies a query. CLI programs are available for flat_l2, ivf_flat, pq_flat, and vamana. These have been tested but not seriously benchmarked.
  • The IVFFlat class has a single query member function for infinite ram, and a single query member function for finite ram. They point to qv_query_infinite_ram and qv_query_finite_ram, respectively. With future benchmarking and profiling, we will ensure that those are the best performing queries and that they have the same or better performance than the CLI and Python benchmarks used for our first blog post.
  • Wrapped the tiledb::Array member of tdbMatrix in a std::unique_ptr due to an issue with moving a tiledb::Array. Also changed the corresponding tdb_helper to return a std::unique_ptr<tiledb::Array> rather than a plan tiledb::Array.
  • Applied clang-format-14.

Copy link

This pull request has been linked to Shortcut Story #36874: Create type-erased IVFFlatIndex.

@lums658 lums658 mentioned this pull request Dec 14, 2023
lums658 added a commit that referenced this pull request Jan 24, 2024
## Type Erased FeatureVector and FeatureVectorArray
This is a recap of the feature vector and feature vector array type-erasure component of #154.  

* The primary contribution of this PR is the implementation of type-erased `FeatureVector` and `FeatureVectorArray`.  Since one of the goals of having the type-erased classes is to seamlessly integrate with Python, the Python bindings for these classes were also added.
* Some updating of matrix and vector I/O was done since the type-erased classes were just copied over.  To update matrix and vector, the files implementing them were copied over the existing files, with some minimal changes.
* The cpo `num_vectors()` replaced `size()` in numerous places.

**NOTE:**  The type-erased Python binding files that were copied over included code for index classes.  This code has been temporarily commented out (usually with #ifdef 0), pending the next PR.

## Overview of Type Erasure

(See also the README.md in include/api).

Type erasure is accomplished as a three-layer cake:
* A non-templated abstract base class with pure virtual functions for the member functions that we want to expose from the typed C++ classes.
* An implementation class that inherits from the abstract base class.  It is templated on the concrete class to be wrapped and keeps a member variable that is an object of the class to be wrapped.  Its member functions (all of which are overrides of the abstract base class member functions) forward to the underlying C++ implementation by invoking the appropriate member.  At this point, they internal data that is stored by the typed member variable also needs to be converted to the appropriate type before invoking the member variable function.
* A non-templated class that presents the user API.  It internally defines the abstract base class and the implementation class.  It has a `std::unique_ptr` to the abstract base class as a member variable.  During construction (either by passing in already constructed vectors or by reading the index from a TileDB group, the appropriate template types for the internal data to be stored by the internal implementation are inferred and an object of the implementation class is constructed and stored in the `std::unique_ptr`.

To illustrate the basic idea, consider `FeatureVector`.  In abbreviated form, where we just show a single function 'data', looks like this:
```c++
class FeatureVector {
    FeatureVector(const tiledb::Context& ctx, const std::string uri) {
      // get type of vector stored in uri array -- say, float
      feature_vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri);
    }
    auto data() {
     return feature_vector_->data();
   }

   class Base {
    virtual void* data() = 0;
  };

  template <class T>
  class Impl {
     explicit Impl(T&& t)
        : impl_vector_(std::forward<T>(t)) {
    }
    T impl_vector_;
  };

  std::unique<Base> feature_vector_;
};

The constructor to read the `FeatureVector` from a TileDB array has the following prototype:
```c++
FeatureVector(const tiledb::Context& ctx, const std::string& uri);
```
When that constructor is invoked, it first reads the schema associated with the `uri` and creates an implementation object based on that type.  For example, if the type read from the schema (`feature_type`) is one of a `float` or `uint8`, the constructor dispatches like this:
```c++
switch (feature_type_) {
      case TILEDB_FLOAT32:
        vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri);
        break;
      case TILEDB_UINT8:
        vector_ = std::make_unique<vector_impl<tdbVector<uint8_t>>>(ctx, uri);
        break;
}
```
At this point, we have created a `std::unique_ptr` of the abstract base class that points to an object of the derived class.
If we invoke the `data` member function of the outer (type-erased) `FeatureVector` class, we dispatch to the corresponding member of the object stored in the `std::unique_ptr`:
```c++
 auto data() const {
    return feature_vector_->data();
  }
```
Since `feature_vector_` actually points to the derived implementation class, its `data` member function is then invoked:
```c++
void* data() override {
      return impl_vector_->data();
    }
```
We return a `void*` since `data()` is an override of the non-templated `Base` class.
@lums658
Copy link
Contributor Author

lums658 commented Feb 12, 2024

Changes here incorporated in PR #210.

@lums658 lums658 closed this Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant