-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for empty and matrices and vectors #254
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main comment -- Instead of having a flag to indicate whether to read the entire array (i.e., use extents as dimensions to read into), use an overload that simply omits any specified limits.
Using special values of parameters to indicate different cases is probably not a good practice in general, unless the special value is absolutely, positively not a valid value (such as a nullptr
). Mea culpa.
@@ -36,15 +36,15 @@ def load_as_matrix( | |||
a = tiledb.ArraySchema.load(path, ctx=tiledb.Ctx(config)) | |||
dtype = a.attr(0).dtype | |||
if dtype == np.float32: | |||
m = tdbColMajorMatrix_f32(ctx, path, 0, 0, 0, size, 0, timestamp) | |||
m = tdbColMajorMatrix_f32(ctx, path, 0, 0, 0, size, 0, timestamp, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this interface to tdbMatrix
. I think what I would rather see (and I think have done elsewhere for something else) is to use an overload so that if we want to read the whole thing, we don't pass in a limit at all.
(Using special values of variables can be bad if the special value is actually a valid value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thanks, updated the constructors so that callers can use the appropriate one without this flag.
@@ -130,7 +130,7 @@ class tdbBlockedMatrix : public MatrixBase { | |||
*/ | |||
tdbBlockedMatrix(const tiledb::Context& ctx, const std::string& uri) noexcept | |||
requires(std::is_same_v<LayoutPolicy, stdx::layout_left>) | |||
: tdbBlockedMatrix(ctx, uri, 0, 0, 0, 0, 0, 0) { | |||
: tdbBlockedMatrix(ctx, uri, 0, 0, 0, 0, 0, 0, true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, we should not use explicit flag, but just omit all of the view parameters.
And, if there are issues related to ambiguity because of the suggested overloading (I don't think there will be), we should use a class that is a start-stop pair, rather than separate integers. This is something I have had in mind to do anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks, done. Will keep that in mind as well.
Note that the Windows build is failing but I will fix before landing. Just opening up this PR again for review so that I can catch @lums658 today while he's still online. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a fundamental flaw in the way we have proposed not to have sentinel values for start
and stop
(et al). Basically, we need to be able to specify whether to ignore or use any of start
/ stop
quantities when reading vectors or matrices -- and that isn't possible to do with overloading.
There is also the issue of having to open and close an array to get its dimensions, before opening the array (again getting its schema) and getting its data.
The only solution I can see to this is to use std::optional
to indicate whether or not to use the start
/ stop
value, or whether to use the begin / end of the array. (Presence or absence of a value is what std::optional
is for, after all.). This approach also expresses intent of the caller -- using {}
instead of a value says "no value" -- i.e., in this case, use the appropriate limit.
Another important change, in conjunction with using std::optional
is to use the same type as the domain, since we are representing values in that domain. tdbMatrix
has some type aliases that can be used for this, or we could define something globally.
These changes should simplify quite a few things.
PS -- Should we distinguish "empty" from "uninitialized"? Meaning empty is start
== stop
, whereas in the use case that started all of this, we do have start
== stop
, but it seems like there may be more to it being a newly-created array than that.
if size == 0: | ||
# Read all rows and all cols. Set no upper_bound. | ||
if dtype == np.float32: | ||
m = tdbColMajorMatrix_f32(ctx, path, 0, timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point (not now), we might want to have a constructor that doesn't take and upper bound when we want to "read the whole thing". Having the special value seems clumsy now. OTOH, I don't think there would be the case where zero would get passed in as a valid value and good enough is probably good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's leave it then and change if we see the need to.
auto array_rows_{domain_.dimension(0)}; | ||
|
||
if (read_full_vector) { | ||
if (start_pos == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thinking about this, just like stop_pos == 0
should have no special meaning as a flag, stop_pos == 0
should have no special meaning (0 should not be a sentinel). That is, we might want to read a "half-full" vector -- starting at a given place and reading to the end or starting at the beginning and reading to a given place -- and the "given place" in either case could be zero, which would be the wrong thing since right now it is either both or neither. We can do that currently (though there is a bug -- if you want to actually start at zero but the array doesn't start at zero). So we shouldn't be using a sentinel value for either start
or stop
Unfortunately, we can't determine which half-full case we want with an overload that has one of start
or stop
omitted.
There are four cases:
- array begin to array end --
read_full_vector == true
in the helper -- at which pointstart
andstop
are sentinel values and "end" is indicating by the overloaded function passing the sentinel values to the helper function. - array begin to
stop
-- not supported in current PR (previously was) start
to array end -- not supported in current PR (previously was, though a bug forstart == 0
)start
tostop
-- in the current PR, 0 is a valid value for eitherstart
orstop
where zero is a valid value forstart
orstop
and array begin or array end (or both) may be zero or may not be zero. Something like Python's range notation would be really helpful....
While we are waiting for C++ to introduce that notation, we need a way to specify "array begin" or "array end" for either start
or stop
-- without using a sentinel. Or at least not using a sentinel that is a value likely to actually be used.
- One approach would be to use a sentinel that is unlikely to ever be used, like -MAXINT and MAXINT. Actually, in that case, collisions would be benign -- meaning "read from the beginning" and "read from this value" are the same.
- Or, we could use
std::optional
-- an empty value would mean from the beginning/end forstart
/stop
. The notation from the caller side is kind of indicative of what is desired.
template <class T> std::vector<T> read_vector(const tiledb::Context& ctx, const std::string& uri,
std::optional<int32_t> start_pos, std::optional<int32_t> end_pos, std::optional<size_t> timestamp);
auto foo = read_vector(ctx, "a.tdb", 0, {}, {});
auto bar = read_vector(ctx, "b.tdb", {}, {}, {}); // Could still have overload: read_vector(ctx, "b.tdb", {});
(Note that I am also suggesting we use optional for timestamp, though maybe default value of zero really is what is intended.)
Note also, and very important, start
and stop
should be int32_t
since that is what the vectors and matrix types assume (we should use an appropriate type alias so that they are all consistent).
The constructors for tdbMatrix
(et al) have the same issues (but twice as many variables for factorially more combinations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree there is certainly more we can improve here, but I would propose to go with the current approach because it is working and is simple. If we find scenarios where it doesn't support our needs, then we can update it then.
@@ -114,6 +114,26 @@ class tdbBlockedMatrix : public MatrixBase { | |||
// size_t pending_row_offset{0}; | |||
// size_t pending_col_offset{0}; | |||
|
|||
static size_t get_last_row( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I don't think we want to open and close the array to get this info, only to open the array again.
However, I think this is a moot point, per the extensive comment above, as I think we should move everything into one constructor and use std::optional
to indicate whether or not we want begin/end of the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated to new approach.
*/ | ||
tdbBlockedMatrix(const tiledb::Context& ctx, const std::string& uri) noexcept | ||
tdbBlockedMatrix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the extensive comment above, I think the correct interface is to have just a single constructor with std::optional
for start_row
, et al. This will simplify things a lot (code-wise and conceptually) and also allow the array to just be opened once in order to access its schema (it is opened in the initializer of the general constructor) so that begin_row
, last_row
, begin_col
, and last_col
are cheaply obtained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated to new approach.
@@ -147,10 +185,39 @@ class tdbBlockedMatrix : public MatrixBase { | |||
tdbBlockedMatrix( | |||
const tiledb::Context& ctx, | |||
const std::string& uri, | |||
size_t first_col, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the extensive comment above, first_col
et al should be the same type as the domain -- we should probably have a global type alias for this. In tdbMatrix
we use row_domain_type
and col_domain_type
(both of which are int32_t
-- but we should not hard code that everywhere). We could just use those:
tdbBlockedMatrix(
const tiledb::Context& ctx,
const std::string& uri,
std::optional<col_domain_type> first_row, // etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, created a new story for this as: https://app.shortcut.com/tiledb-inc/story/42696
Yes. That is why I went with my original approach of using a flag before you requested: |
To make the next review easier I'm going to close this PR and re-open as a two PRs: one for vector and one for |
What
When we add type-erased APIs we will have a flow where we need to create an empty index:
When we do that we need to be able to write empty arrays and vectors. This writing of empty arrays and vectors works, but when we try to read them back we use code like this for both vectors and matrices to fill in the returned data:
Because of this we then try to run the query:
But we crash because
data_.data()
is actually empty and so there is no where to write the data to.The fix that this PR makes is change the default behavior to not update the start and end positions and to return early if there is nothing to read. Ideally we would get to a point where we never need the default behavior and the caller is responsible for correctly asking for how much data to read. This is based off of what Python does where it has the caller manage how much to read, versus the reader automatically reading all it can. Though please do let me know if you have a different suggestion on how to fix this.
Testing