Treat zarr metadata as a blob (mostly) #749

paraseba · 2025-02-18T22:37:38Z

We were parsing too much of Zarr metadata. Icechunk currently is only interested in the array size and chunk sizes. It may become interested in the dimension names at some point. But still, we were parsing the whole metadata, storing internally as parsed object and then formatting it back to json.

We did this when the project started, imagining we may need more from the metadata. For example, we thought we could need to incorporate the codec pipeline in Icechunk.

With this patch, we now only extract the parts of the zarr metadata we care about. And we preserve the original blob of metadata as is, in a new user_data byte array. We return this blob in metadata requests.

If, in the future, we need more from the metadata, we can parse it and add it to the storage.

Simpler and less code. It works with zarr extensions, it's more resilient to zarr spec changes.

There is a price to this: we are no longer separating the user attributes from the rest of the metadata. The only impact of this, is we no longe can treat conflicts in user attributes separate from the rest of the zarr metadata.

If we consider this important in the short term, we can add it back by parsing more of the metadata blobs.

Also in this change:

No more AttributeFile. We'll implement it when we need it
Better snapshot serialization

Closes #391
Closes: #690

[on-disk breaking change]

We were parsing too much of Zarr metadata. Icechunk currently is only interested in the array size and chunk sizes. It may become interested in the dimension names at some point. But still, we were parsing the whole metadata, storing internally as parsed object and then formatting it back to json. We did this when the project started, imagining we may need more from the metadata. For example, we thought we could need to incorporate the codec pipeline in Icechunk. With this patch, we now only extract the parts of the zarr metadata we care about. And we preserve the original blob of metadata as is, in a new user_data byte array. We return this blob in metadata requests. If, in the future, we need more from the metadata, we can parse it and add it to the storage. Simpler and less code. It works with zarr extensions, it's more resilient to zarr spec changes. There is a price to this: we are no longer separating the user attributes from the rest of the metadata. The only impact of this, is we no longe can treat conflicts in user attributes separate from the rest of the zarr metadata. If we consider this important in the short term, we can add it back by parsing more of the metadata blobs. Also in this change: - No more AttributeFile. We'll implement it when we need it - Better snapshot serialization [on-disk breaking change]

dcherian · 2025-02-19T16:17:31Z

icechunk/src/format/snapshot.rs

-    Group,
-    Array,
+pub struct DimensionShape {
+    array_length: u64,


Suggested change

array_length: u64,

dim_length: u64,

I think this would be less confusing in the long run. array.size means "number of elements in the array" in numpy, so I was confused at first.

icechunk/src/format/snapshot.rs

dcherian · 2025-02-19T16:34:32Z

icechunk/src/change_set.rs

    session::SessionResult,
 };

-#[derive(Clone, Debug, PartialEq, Default, Serialize, Deserialize)]
+#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)]
+pub struct ArrayData {


Wondering why not share this with NodeData::Array like before. But I guess this struct is so small, that flatter is nicer.

I thought it wasn't 100% the same? I think user_data is in different places?

dcherian · 2025-02-19T16:37:00Z

icechunk/src/change_set.rs

@@ -44,8 +45,12 @@ impl ChangeSet {
        self.deleted_groups.iter()
    }

-    pub fn user_attributes_updated_nodes(&self) -> impl Iterator<Item = &NodeId> {
-        self.updated_attributes.keys()
+    pub fn updated_arrays(&self) -> impl Iterator<Item = &NodeId> {


Is this really "nodes_with_updated_attributes"?

no, it could be any part of the zarr metadata too

self.updated_arrays.keys().chain(self.updated_groups.keys())

Why are we chaining updated_groups here?

🤦 great catch ...

It shows the quality of our conflict testing

dcherian · 2025-02-19T16:37:08Z

icechunk/src/change_set.rs

+    }
+
+    pub fn updated_groups(&self) -> impl Iterator<Item = &NodeId> {
+        self.updated_groups.keys().chain(self.updated_groups.keys())


Suggested change

self.updated_groups.keys().chain(self.updated_groups.keys())

self.updated_groups.keys()

dcherian · 2025-02-19T16:44:47Z

icechunk/src/session.rs

@@ -3840,6 +3665,7 @@ mod tests {

        let err = ds2.rebase(&solver).await.unwrap_err();

+        dbg!(&err);


Suggested change

dbg!(&err);

dcherian · 2025-02-19T16:49:21Z

icechunk-python/python/icechunk/_icechunk_python.pyi

-            A user attribute update conflicts with an existing user attribute update
-        UserAttributesUpdateOfDeletedNode: int
-            A user attribute update is attempted on a deleted node
+        ZarrMetadataUpdateOfDeletedGroup: int


these should all. be tuple[int] but why that choice?

I have no idea actually ... @mpiannucci maybe?

dcherian

I looked at change_set, snapshot most closely. I really like this simplification. Just some minor comments. A couple of helper methods in ChangeSet look a bit wonky (I left a comment)

paraseba requested review from mpiannucci and dcherian February 18, 2025 22:37

paraseba force-pushed the push-rwuxxtyswwyr branch from c89e0dc to 06a66df Compare February 19, 2025 00:31

Merge branch 'main' into push-rwuxxtyswwyr

3fc11a0

dcherian added the breaking-change label Feb 19, 2025

dcherian reviewed Feb 19, 2025

View reviewed changes

icechunk/src/format/snapshot.rs Show resolved Hide resolved

dcherian reviewed Feb 19, 2025

View reviewed changes

icechunk/src/format/snapshot.rs Show resolved Hide resolved

dcherian reviewed Feb 19, 2025

View reviewed changes

Enable complex arrays in tests

dd691a3

dcherian reviewed Feb 19, 2025

View reviewed changes

dcherian approved these changes Feb 19, 2025

View reviewed changes

fix xarray test

4259068

paraseba force-pushed the push-rwuxxtyswwyr branch from 55b125c to 4259068 Compare February 19, 2025 23:36

paraseba merged commit 0458b9f into main Feb 19, 2025
5 checks passed

paraseba deleted the push-rwuxxtyswwyr branch February 19, 2025 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat zarr metadata as a blob (mostly) #749

Treat zarr metadata as a blob (mostly) #749

paraseba commented Feb 18, 2025 •

edited by dcherian

Loading

dcherian Feb 19, 2025

dcherian Feb 19, 2025

paraseba Feb 19, 2025

dcherian Feb 19, 2025

paraseba Feb 19, 2025

dcherian Feb 19, 2025

paraseba Feb 19, 2025

dcherian Feb 19, 2025

dcherian Feb 19, 2025

dcherian Feb 19, 2025

paraseba Feb 19, 2025

dcherian left a comment

	self.updated_groups.keys().chain(self.updated_groups.keys())
	self.updated_groups.keys()

		@@ -3840,6 +3665,7 @@ mod tests {

		let err = ds2.rebase(&solver).await.unwrap_err();

		dbg!(&err);

Treat zarr metadata as a blob (mostly) #749

Treat zarr metadata as a blob (mostly) #749

Conversation

paraseba commented Feb 18, 2025 • edited by dcherian Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian left a comment

Choose a reason for hiding this comment

paraseba commented Feb 18, 2025 •

edited by dcherian

Loading