Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat zarr metadata as a blob (mostly) #749

Merged
merged 4 commits into from
Feb 19, 2025
Merged

Treat zarr metadata as a blob (mostly) #749

merged 4 commits into from
Feb 19, 2025

Conversation

paraseba
Copy link
Collaborator

@paraseba paraseba commented Feb 18, 2025

We were parsing too much of Zarr metadata. Icechunk currently is only interested in the array size and chunk sizes. It may become interested in the dimension names at some point. But still, we were parsing the whole metadata, storing internally as parsed object and then formatting it back to json.

We did this when the project started, imagining we may need more from the metadata. For example, we thought we could need to incorporate the codec pipeline in Icechunk.

With this patch, we now only extract the parts of the zarr metadata we care about. And we preserve the original blob of metadata as is, in a new user_data byte array. We return this blob in metadata requests.

If, in the future, we need more from the metadata, we can parse it and add it to the storage.

Simpler and less code. It works with zarr extensions, it's more resilient to zarr spec changes.

There is a price to this: we are no longer separating the user attributes from the rest of the metadata. The only impact of this, is we no longe can treat conflicts in user attributes separate from the rest of the zarr metadata.

If we consider this important in the short term, we can add it back by parsing more of the metadata blobs.

Also in this change:

  • No more AttributeFile. We'll implement it when we need it
  • Better snapshot serialization

Closes #391
Closes: #690

[on-disk breaking change]

We were parsing too much of Zarr metadata. Icechunk currently is only
interested in the array size and chunk sizes. It may become interested
in the dimension names at some point. But still, we were parsing the
whole metadata, storing internally as parsed object and then formatting
it back to json.

We did this when the project started, imagining we may need more from
the metadata. For example, we thought we could need to incorporate the
codec pipeline in Icechunk.

With this patch, we now only extract the parts of the zarr metadata we
care about. And we preserve the original blob of metadata as is, in a
new user_data byte array. We return this blob in metadata requests.

If, in the future, we need more from the metadata, we can parse it and add
it to the storage.

Simpler and less code. It works with zarr extensions, it's more
resilient to zarr spec changes.

There is a price to this: we are no longer separating the user
attributes from the rest of the metadata. The only impact of this, is we
no longe can treat conflicts in user attributes separate from the rest
of the zarr metadata.

If we consider this important in the short term, we can add it back by
parsing more of the metadata blobs.

Also in this change:

- No more AttributeFile. We'll implement it when we need it
- Better snapshot serialization

[on-disk breaking change]
Group,
Array,
pub struct DimensionShape {
array_length: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
array_length: u64,
dim_length: u64,

I think this would be less confusing in the long run. array.size means "number of elements in the array" in numpy, so I was confused at first.

session::SessionResult,
};

#[derive(Clone, Debug, PartialEq, Default, Serialize, Deserialize)]
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)]
pub struct ArrayData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why not share this with NodeData::Array like before. But I guess this struct is so small, that flatter is nicer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it wasn't 100% the same? I think user_data is in different places?

@@ -44,8 +45,12 @@ impl ChangeSet {
self.deleted_groups.iter()
}

pub fn user_attributes_updated_nodes(&self) -> impl Iterator<Item = &NodeId> {
self.updated_attributes.keys()
pub fn updated_arrays(&self) -> impl Iterator<Item = &NodeId> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really "nodes_with_updated_attributes"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it could be any part of the zarr metadata too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.updated_arrays.keys().chain(self.updated_groups.keys())

Why are we chaining updated_groups here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 great catch ...

It shows the quality of our conflict testing

}

pub fn updated_groups(&self) -> impl Iterator<Item = &NodeId> {
self.updated_groups.keys().chain(self.updated_groups.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.updated_groups.keys().chain(self.updated_groups.keys())
self.updated_groups.keys()

@@ -3840,6 +3665,7 @@ mod tests {

let err = ds2.rebase(&solver).await.unwrap_err();

dbg!(&err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dbg!(&err);

A user attribute update conflicts with an existing user attribute update
UserAttributesUpdateOfDeletedNode: int
A user attribute update is attempted on a deleted node
ZarrMetadataUpdateOfDeletedGroup: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should all. be tuple[int] but why that choice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea actually ... @mpiannucci maybe?

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at change_set, snapshot most closely. I really like this simplification. Just some minor comments. A couple of helper methods in ChangeSet look a bit wonky (I left a comment)

@paraseba paraseba merged commit 0458b9f into main Feb 19, 2025
5 checks passed
@paraseba paraseba deleted the push-rwuxxtyswwyr branch February 19, 2025 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing of zarr array metadata limits icechunk usability Support complex fill_value with NaNs and Infs
2 participants