Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type erased feature vector and feature vector array classes #210

Merged
merged 29 commits into from
Jan 24, 2024

Conversation

lums658
Copy link
Contributor

@lums658 lums658 commented Jan 23, 2024

Type Erased FeatureVector and FeatureVectorArray

This is a recap of the feature vector and feature vector array type-erasure component of #154.

  • The primary contribution of this PR is the implementation of type-erased FeatureVector and FeatureVectorArray. Since one of the goals of having the type-erased classes is to seamlessly integrate with Python, the Python bindings for these classes were also added.
  • Some updating of matrix and vector I/O was done since the type-erased classes were just copied over. To update matrix and vector, the files implementing them were copied over the existing files, with some minimal changes.
  • The cpo num_vectors() replaced size() in numerous places.

Files Added (Copied from #173)

  • include/api/README.md
  • include/api/api_defs.h
  • include/api/feature_vector.h
  • include/api/feature_vector_array.h
  • include/tdb_defs.h
  • include/test/test_utils.h
  • include/test/unit_api_feature_vector.cc
  • include/test/unit_api_feature_vector_array.cc
  • external/test_data/arrays/bigann10k

Files Copied Over Previous

  • include/detail/ivf/vq.h
  • include/detail/linalg/linalg_defs.h
  • include/detail/linalg/matrix.h
  • include/detail/linalg/tdb_io.h
  • include/detail/linalg/vector.h

Files Modified

In addition, the following had small modifications (mostly to change size() to num_vectors()):

  • include/algorithm.h
  • include/detail/flat/qv.h
  • include/detail/flat/vq.h
  • include/detail/ivf/dist_qv.h
  • include/detail/ivf/index.h
  • include/detail/ivf/partition.h
  • include/detail/ivf/qv.h
  • include/detail/linalg/tdb_matrix.h
  • include/detail/linalg/tdb_partitioned_matrix.h
  • nclude/scoring.h
  • include/test/CMakeLists.txt
  • include/test/unit_ivf_qv.cc
  • include/test/unit_ivf_vq.cc
  • src/ivf_flat.cc

Python Bindings

Added (Copied from #173)

  • python/src/tiledb/vector_search/module2.cc
  • python/test/test_module2.py

Modified

  • python/CMakeLists.txt
  • python/src/tiledb/vector_search/module.cc
  • python/src/tiledb/vector_search/init.py
  • python/CMakeLists.txt

NOTE: The type-erased Python binding files that were copied over included code for index classes. This code has been temporarily commented out (usually with #ifdef 0), pending the next PR.

Arrays

  • bigann10k (this is resulting in a large number of reported changed files)

Overview of Type Erasure

(See also the README.md in include/api).

Type erasure is accomplished as a three-layer cake:

  • A non-templated abstract base class with pure virtual functions for the member functions that we want to expose from the typed C++ classes.
  • An implementation class that inherits from the abstract base class. It is templated on the concrete class to be wrapped and keeps a member variable that is an object of the class to be wrapped. Its member functions (all of which are overrides of the abstract base class member functions) forward to the underlying C++ implementation by invoking the appropriate member. At this point, they internal data that is stored by the typed member variable also needs to be converted to the appropriate type before invoking the member variable function.
  • A non-templated class that presents the user API. It internally defines the abstract base class and the implementation class. It has a std::unique_ptr to the abstract base class as a member variable. During construction (either by passing in already constructed vectors or by reading the index from a TileDB group, the appropriate template types for the internal data to be stored by the internal implementation are inferred and an object of the implementation class is constructed and stored in the std::unique_ptr.

To illustrate the basic idea, consider FeatureVector. In abbreviated form, where we just show a single function 'data', looks like this:

class FeatureVector {
    FeatureVector(const tiledb::Context& ctx, const std::string uri) {
      // get type of vector stored in uri array -- say, float
      feature_vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri);
    }
    auto data() {
     return feature_vector_->data();
   }

   class Base {
    virtual void* data() = 0;
  };

  template <class T>
  class Impl {
     explicit Impl(T&& t)
        : impl_vector_(std::forward<T>(t)) {
    }
    T impl_vector_;
  };

  std::unique<Base> feature_vector_;
};

The constructor to read the `FeatureVector` from a TileDB array has the following prototype:
```c++
FeatureVector(const tiledb::Context& ctx, const std::string& uri);

When that constructor is invoked, it first reads the schema associated with the uri and creates an implementation object based on that type. For example, if the type read from the schema (feature_type) is one of a float or uint8, the constructor dispatches like this:

switch (feature_type_) {
      case TILEDB_FLOAT32:
        vector_ = std::make_unique<vector_impl<tdbVector<float>>>(ctx, uri);
        break;
      case TILEDB_UINT8:
        vector_ = std::make_unique<vector_impl<tdbVector<uint8_t>>>(ctx, uri);
        break;
}

At this point, we have created a std::unique_ptr of the abstract base class that points to an object of the derived class.
If we invoke the data member function of the outer (type-erased) FeatureVector class, we dispatch to the corresponding member of the object stored in the std::unique_ptr:

 auto data() const {
    return feature_vector_->data();
  }

Since feature_vector_ actually points to the derived implementation class, its data member function is then invoked:

void* data() override {
      return impl_vector_->data();
    }

We return a void* since data() is an override of the non-templated Base class.
(TODO: In a future PR maybe we can cast to an appropriate type extracted from the type of vector_?)
(TODO: Is there a way to condense the boilerplate that is currently contained in all of these?)

apis/python/src/tiledb/vector_search/module2.cc Outdated Show resolved Hide resolved

import os
# TODO Use python Pathlib
# m1_root = "/Users/lums/TileDB/TileDB-Vector-Search/external/data/gp3/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove
Same bellow for vector_search_root

apis/python/test/test_module2.py Outdated Show resolved Hide resolved
src/include/api/feature_vector_array.h Outdated Show resolved Hide resolved
src/include/detail/linalg/matrix.h Outdated Show resolved Hide resolved
src/include/detail/linalg/tdb_io.h Show resolved Hide resolved
@lums658 lums658 merged commit 5f4797d into main Jan 24, 2024
5 checks passed
@lums658 lums658 deleted the lums/tmp/type-erased branch January 24, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants