-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat zarr metadata as a blob (mostly) #749
Conversation
We were parsing too much of Zarr metadata. Icechunk currently is only interested in the array size and chunk sizes. It may become interested in the dimension names at some point. But still, we were parsing the whole metadata, storing internally as parsed object and then formatting it back to json. We did this when the project started, imagining we may need more from the metadata. For example, we thought we could need to incorporate the codec pipeline in Icechunk. With this patch, we now only extract the parts of the zarr metadata we care about. And we preserve the original blob of metadata as is, in a new user_data byte array. We return this blob in metadata requests. If, in the future, we need more from the metadata, we can parse it and add it to the storage. Simpler and less code. It works with zarr extensions, it's more resilient to zarr spec changes. There is a price to this: we are no longer separating the user attributes from the rest of the metadata. The only impact of this, is we no longe can treat conflicts in user attributes separate from the rest of the zarr metadata. If we consider this important in the short term, we can add it back by parsing more of the metadata blobs. Also in this change: - No more AttributeFile. We'll implement it when we need it - Better snapshot serialization [on-disk breaking change]
c89e0dc
to
06a66df
Compare
icechunk/src/format/snapshot.rs
Outdated
Group, | ||
Array, | ||
pub struct DimensionShape { | ||
array_length: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_length: u64, | |
dim_length: u64, |
I think this would be less confusing in the long run. array.size
means "number of elements in the array" in numpy, so I was confused at first.
session::SessionResult, | ||
}; | ||
|
||
#[derive(Clone, Debug, PartialEq, Default, Serialize, Deserialize)] | ||
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)] | ||
pub struct ArrayData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why not share this with NodeData::Array
like before. But I guess this struct is so small, that flatter is nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it wasn't 100% the same? I think user_data is in different places?
@@ -44,8 +45,12 @@ impl ChangeSet { | |||
self.deleted_groups.iter() | |||
} | |||
|
|||
pub fn user_attributes_updated_nodes(&self) -> impl Iterator<Item = &NodeId> { | |||
self.updated_attributes.keys() | |||
pub fn updated_arrays(&self) -> impl Iterator<Item = &NodeId> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really "nodes_with_updated_attributes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it could be any part of the zarr metadata too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.updated_arrays.keys().chain(self.updated_groups.keys())
Why are we chaining updated_groups
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 great catch ...
It shows the quality of our conflict testing
icechunk/src/change_set.rs
Outdated
} | ||
|
||
pub fn updated_groups(&self) -> impl Iterator<Item = &NodeId> { | ||
self.updated_groups.keys().chain(self.updated_groups.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.updated_groups.keys().chain(self.updated_groups.keys()) | |
self.updated_groups.keys() |
icechunk/src/session.rs
Outdated
@@ -3840,6 +3665,7 @@ mod tests { | |||
|
|||
let err = ds2.rebase(&solver).await.unwrap_err(); | |||
|
|||
dbg!(&err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbg!(&err); |
A user attribute update conflicts with an existing user attribute update | ||
UserAttributesUpdateOfDeletedNode: int | ||
A user attribute update is attempted on a deleted node | ||
ZarrMetadataUpdateOfDeletedGroup: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should all. be tuple[int]
but why that choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea actually ... @mpiannucci maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at change_set, snapshot most closely. I really like this simplification. Just some minor comments. A couple of helper methods in ChangeSet look a bit wonky (I left a comment)
55b125c
to
4259068
Compare
We were parsing too much of Zarr metadata. Icechunk currently is only interested in the array size and chunk sizes. It may become interested in the dimension names at some point. But still, we were parsing the whole metadata, storing internally as parsed object and then formatting it back to json.
We did this when the project started, imagining we may need more from the metadata. For example, we thought we could need to incorporate the codec pipeline in Icechunk.
With this patch, we now only extract the parts of the zarr metadata we care about. And we preserve the original blob of metadata as is, in a new user_data byte array. We return this blob in metadata requests.
If, in the future, we need more from the metadata, we can parse it and add it to the storage.
Simpler and less code. It works with zarr extensions, it's more resilient to zarr spec changes.
There is a price to this: we are no longer separating the user attributes from the rest of the metadata. The only impact of this, is we no longe can treat conflicts in user attributes separate from the rest of the zarr metadata.
If we consider this important in the short term, we can add it back by parsing more of the metadata blobs.
Also in this change:
Closes #391
Closes: #690
[on-disk breaking change]