Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for empty and matrices and vectors #254

Closed
wants to merge 14 commits into from

Conversation

jparismorgan
Copy link
Contributor

What

When we add type-erased APIs we will have a flow where we need to create an empty index:

def create(
    uri: str,
    dimensions: int,
    vector_type: np.dtype,
    id_type: np.dtype = np.uint32,
    adjacency_row_index_type: np.dtype = np.uint32,
    group_exists: bool = False,
    config: Optional[Mapping[str, Any]] = None,
    storage_version: str = STORAGE_VERSION,
    **kwargs,
) -> VamanaIndex:
      ctx = vspy.Ctx(config)
      index = vspy.IndexVamana(
          feature_type=np.dtype(vector_type).name, 
          id_type=np.dtype(id_type).name, 
          adjacency_row_index_type=np.dtype(adjacency_row_index_type).name, 
          dimension=dimensions,
        )
      index.write_index(ctx, uri)
      return VamanaIndex(uri=uri, config=config, memory_budget=1000000) # this is a Python VamanaIndex object.

When we do that we need to be able to write empty arrays and vectors. This writing of empty arrays and vectors works, but when we try to read them back we use code like this for both vectors and matrices to fill in the returned data:

if (start_pos == 0) {
    start_pos = array_rows_.template domain<domain_type>().first;
  }
  if (end_pos == 0) {
    end_pos = array_rows_.template domain<domain_type>().second + 1;
  }

Because of this we then try to run the query:

tiledb::Query query(ctx, *array_);
  query.set_subarray(subarray).set_data_buffer(
      attr_name, data_.data(), vec_rows_);
  tiledb_helpers::submit_query(tdb_func__, uri, query);

But we crash because data_.data() is actually empty and so there is no where to write the data to.

The fix that this PR makes is change the default behavior to not update the start and end positions and to return early if there is nothing to read. Ideally we would get to a point where we never need the default behavior and the caller is responsible for correctly asking for how much data to read. This is based off of what Python does where it has the caller manage how much to read, versus the reader automatically reading all it can. Though please do let me know if you have a different suggestion on how to fix this.

Testing

  • Added new unit tests.
  • Existing unit tests pass.

@jparismorgan jparismorgan marked this pull request as ready for review February 29, 2024 15:01
Copy link
Contributor

@lums658 lums658 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main comment -- Instead of having a flag to indicate whether to read the entire array (i.e., use extents as dimensions to read into), use an overload that simply omits any specified limits.

Using special values of parameters to indicate different cases is probably not a good practice in general, unless the special value is absolutely, positively not a valid value (such as a nullptr). Mea culpa.

@@ -36,15 +36,15 @@ def load_as_matrix(
a = tiledb.ArraySchema.load(path, ctx=tiledb.Ctx(config))
dtype = a.attr(0).dtype
if dtype == np.float32:
m = tdbColMajorMatrix_f32(ctx, path, 0, 0, 0, size, 0, timestamp)
m = tdbColMajorMatrix_f32(ctx, path, 0, 0, 0, size, 0, timestamp, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this interface to tdbMatrix. I think what I would rather see (and I think have done elsewhere for something else) is to use an overload so that if we want to read the whole thing, we don't pass in a limit at all.

(Using special values of variables can be bad if the special value is actually a valid value.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thanks, updated the constructors so that callers can use the appropriate one without this flag.

src/include/detail/linalg/tdb_io.h Show resolved Hide resolved
@@ -130,7 +130,7 @@ class tdbBlockedMatrix : public MatrixBase {
*/
tdbBlockedMatrix(const tiledb::Context& ctx, const std::string& uri) noexcept
requires(std::is_same_v<LayoutPolicy, stdx::layout_left>)
: tdbBlockedMatrix(ctx, uri, 0, 0, 0, 0, 0, 0) {
: tdbBlockedMatrix(ctx, uri, 0, 0, 0, 0, 0, 0, true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we should not use explicit flag, but just omit all of the view parameters.

And, if there are issues related to ambiguity because of the suggested overloading (I don't think there will be), we should use a class that is a start-stop pair, rather than separate integers. This is something I have had in mind to do anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks, done. Will keep that in mind as well.

@jparismorgan
Copy link
Contributor Author

Note that the Windows build is failing but I will fix before landing. Just opening up this PR again for review so that I can catch @lums658 today while he's still online.

Copy link
Contributor

@lums658 lums658 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fundamental flaw in the way we have proposed not to have sentinel values for start and stop (et al). Basically, we need to be able to specify whether to ignore or use any of start / stop quantities when reading vectors or matrices -- and that isn't possible to do with overloading.

There is also the issue of having to open and close an array to get its dimensions, before opening the array (again getting its schema) and getting its data.

The only solution I can see to this is to use std::optional to indicate whether or not to use the start / stop value, or whether to use the begin / end of the array. (Presence or absence of a value is what std::optional is for, after all.). This approach also expresses intent of the caller -- using {} instead of a value says "no value" -- i.e., in this case, use the appropriate limit.

Another important change, in conjunction with using std::optional is to use the same type as the domain, since we are representing values in that domain. tdbMatrix has some type aliases that can be used for this, or we could define something globally.

These changes should simplify quite a few things.

PS -- Should we distinguish "empty" from "uninitialized"? Meaning empty is start == stop, whereas in the use case that started all of this, we do have start == stop, but it seems like there may be more to it being a newly-created array than that.

if size == 0:
# Read all rows and all cols. Set no upper_bound.
if dtype == np.float32:
m = tdbColMajorMatrix_f32(ctx, path, 0, timestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point (not now), we might want to have a constructor that doesn't take and upper bound when we want to "read the whole thing". Having the special value seems clumsy now. OTOH, I don't think there would be the case where zero would get passed in as a valid value and good enough is probably good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's leave it then and change if we see the need to.

src/include/detail/linalg/tdb_io.h Show resolved Hide resolved
src/include/detail/linalg/tdb_matrix.h Show resolved Hide resolved
auto array_rows_{domain_.dimension(0)};

if (read_full_vector) {
if (start_pos == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thinking about this, just like stop_pos == 0 should have no special meaning as a flag, stop_pos == 0 should have no special meaning (0 should not be a sentinel). That is, we might want to read a "half-full" vector -- starting at a given place and reading to the end or starting at the beginning and reading to a given place -- and the "given place" in either case could be zero, which would be the wrong thing since right now it is either both or neither. We can do that currently (though there is a bug -- if you want to actually start at zero but the array doesn't start at zero). So we shouldn't be using a sentinel value for either start or stop

Unfortunately, we can't determine which half-full case we want with an overload that has one of start or stop omitted.

There are four cases:

  • array begin to array end -- read_full_vector == true in the helper -- at which point start and stop are sentinel values and "end" is indicating by the overloaded function passing the sentinel values to the helper function.
  • array begin to stop -- not supported in current PR (previously was)
  • start to array end -- not supported in current PR (previously was, though a bug for start == 0)
  • start to stop -- in the current PR, 0 is a valid value for either start or stop
    where zero is a valid value for start or stop and array begin or array end (or both) may be zero or may not be zero. Something like Python's range notation would be really helpful....

While we are waiting for C++ to introduce that notation, we need a way to specify "array begin" or "array end" for either start or stop -- without using a sentinel. Or at least not using a sentinel that is a value likely to actually be used.

  • One approach would be to use a sentinel that is unlikely to ever be used, like -MAXINT and MAXINT. Actually, in that case, collisions would be benign -- meaning "read from the beginning" and "read from this value" are the same.
  • Or, we could use std::optional -- an empty value would mean from the beginning/end for start/stop. The notation from the caller side is kind of indicative of what is desired.
template <class T> std::vector<T> read_vector(const tiledb::Context& ctx, const std::string& uri, 
std::optional<int32_t> start_pos, std::optional<int32_t> end_pos, std::optional<size_t> timestamp);

auto foo = read_vector(ctx, "a.tdb", 0, {}, {});
auto bar = read_vector(ctx, "b.tdb", {}, {}, {}); // Could still have overload: read_vector(ctx, "b.tdb", {});

(Note that I am also suggesting we use optional for timestamp, though maybe default value of zero really is what is intended.)

Note also, and very important, start and stop should be int32_t since that is what the vectors and matrix types assume (we should use an appropriate type alias so that they are all consistent).

The constructors for tdbMatrix (et al) have the same issues (but twice as many variables for factorially more combinations).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree there is certainly more we can improve here, but I would propose to go with the current approach because it is working and is simple. If we find scenarios where it doesn't support our needs, then we can update it then.

@@ -114,6 +114,26 @@ class tdbBlockedMatrix : public MatrixBase {
// size_t pending_row_offset{0};
// size_t pending_col_offset{0};

static size_t get_last_row(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I don't think we want to open and close the array to get this info, only to open the array again.

However, I think this is a moot point, per the extensive comment above, as I think we should move everything into one constructor and use std::optional to indicate whether or not we want begin/end of the array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated to new approach.

*/
tdbBlockedMatrix(const tiledb::Context& ctx, const std::string& uri) noexcept
tdbBlockedMatrix(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the extensive comment above, I think the correct interface is to have just a single constructor with std::optional for start_row, et al. This will simplify things a lot (code-wise and conceptually) and also allow the array to just be opened once in order to access its schema (it is opened in the initializer of the general constructor) so that begin_row, last_row, begin_col, and last_col are cheaply obtained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated to new approach.

@@ -147,10 +185,39 @@ class tdbBlockedMatrix : public MatrixBase {
tdbBlockedMatrix(
const tiledb::Context& ctx,
const std::string& uri,
size_t first_col,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the extensive comment above, first_col et al should be the same type as the domain -- we should probably have a global type alias for this. In tdbMatrix we use row_domain_type and col_domain_type (both of which are int32_t -- but we should not hard code that everywhere). We could just use those:

tdbBlockedMatrix(
       const tiledb::Context& ctx,
       const std::string& uri,
       std::optional<col_domain_type> first_row, // etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, created a new story for this as: https://app.shortcut.com/tiledb-inc/story/42696

@jparismorgan
Copy link
Contributor Author

jparismorgan commented Mar 7, 2024

There is a fundamental flaw in the way we have proposed not to have sentinel values for start and stop (et al). Basically, we need to be able to specify whether to ignore or use any of start / stop quantities when reading vectors or matrices -- and that isn't possible to do with overloading.

Yes. That is why I went with my original approach of using a flag before you requested: Instead of having a flag to indicate whether to read the entire array (i.e., use extents as dimensions to read into), use an overload that simply omits any specified limits. I will change back but use std::optional this time, agree that it's nicer than a single flag.

@jparismorgan
Copy link
Contributor Author

jparismorgan commented Mar 7, 2024

To make the next review easier I'm going to close this PR and re-open as a two PRs: one for vector and one for tdb_matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants