Skip to content

Commit

Permalink
Treat zarr metadata as a blob (mostly) (#749)
Browse files Browse the repository at this point in the history
* Treat zarr metadata as a blob (mostly)

We were parsing too much of Zarr metadata. Icechunk currently is only
interested in the array size and chunk sizes. It may become interested
in the dimension names at some point. But still, we were parsing the
whole metadata, storing internally as parsed object and then formatting
it back to json.

We did this when the project started, imagining we may need more from
the metadata. For example, we thought we could need to incorporate the
codec pipeline in Icechunk.

With this patch, we now only extract the parts of the zarr metadata we
care about. And we preserve the original blob of metadata as is, in a
new user_data byte array. We return this blob in metadata requests.

If, in the future, we need more from the metadata, we can parse it and add
it to the storage.

Simpler and less code. It works with zarr extensions, it's more
resilient to zarr spec changes.

There is a price to this: we are no longer separating the user
attributes from the rest of the metadata. The only impact of this, is we
no longe can treat conflicts in user attributes separate from the rest
of the zarr metadata.

If we consider this important in the short term, we can add it back by
parsing more of the metadata blobs.

Also in this change:

- No more AttributeFile. We'll implement it when we need it
- Better snapshot serialization

[on-disk breaking change]

* Enable complex arrays in tests

* fix xarray test

---------

Co-authored-by: Deepak Cherian <deepak@earthmover.io>
  • Loading branch information
paraseba and dcherian authored Feb 19, 2025
1 parent cd5dac2 commit 0458b9f
Show file tree
Hide file tree
Showing 64 changed files with 1,728 additions and 3,226 deletions.
1 change: 1 addition & 0 deletions .github/workflows/python-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -163,5 +163,6 @@ jobs:
python3 -m venv .venv
source .venv/bin/activate
pip install icechunk['test'] --find-links dist --force-reinstall
pip install pytest-mypy-plugins
# pass xarray's pyproject.toml so that pytest can find the `flaky` fixture
pytest -c=../../xarray/pyproject.toml -W ignore tests/run_xarray_backends_tests.py
8 changes: 4 additions & 4 deletions docs/docs/icechunk-python/version-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -327,21 +327,21 @@ session2.rebase(icechunk.ConflictDetector())
# RebaseFailedError: Rebase failed on snapshot AE9XS2ZWXT861KD2JGHG: 1 conflicts found
```

This however fails because both sessions modified the `foo` attribute on the root group. We can use the `ConflictError` to get more information about the conflict.
This however fails because both sessions modified metadata. We can use the `ConflictError` to get more information about the conflict.

```python
try:
session2.rebase(icechunk.ConflictDetector())
except icechunk.RebaseFailedError as e:
print(e.conflicts)

# [Conflict(UserAttributesDoubleUpdate, path=/)]
# [Conflict(ZarrMetadataDoubleUpdate, path=/)]
```

This tells us that the conflict is caused by the two sessions modifying the user attributes of the root group (`/`). In this casewe have decided that second session set the `foo` attribute to the correct value, so we can now try to rebase by instructing the `rebase` method to use the second session's changes with the [`BasicConflictSolver`](../reference/#icechunk.BasicConflictSolver).
This tells us that the conflict is caused by the two sessions modifying the metadata attributes of the root group (`/`). In this case we have decided that second session set the `foo` attribute to the correct value, so we can now try to rebase by instructing the `rebase` method to use the second session's changes with the [`BasicConflictSolver`](../reference/#icechunk.BasicConflictSolver).

```python
session2.rebase(icechunk.BasicConflictSolver(on_user_attributes_conflict=icechunk.VersionSelection.UseOurs))
session2.rebase(icechunk.BasicConflictSolver())
```

Success! We can now try and commit the changes again.
Expand Down
47 changes: 20 additions & 27 deletions icechunk-python/python/icechunk/_icechunk_python.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -1393,7 +1393,6 @@ class BasicConflictSolver(ConflictSolver):
This conflict solver allows for simple configuration of resolution behavior for conflicts that may occur during a rebase operation.
It will attempt to resolve a limited set of conflicts based on the configuration options provided.
- When a user attribute conflict is encountered, the behavior is determined by the `on_user_attributes_conflict` option
- When a chunk conflict is encountered, the behavior is determined by the `on_chunk_conflict` option
- When an array is deleted that has been updated, `fail_on_delete_of_updated_array` will determine whether to fail the rebase operation
- When a group is deleted that has been updated, `fail_on_delete_of_updated_group` will determine whether to fail the rebase operation
Expand All @@ -1402,7 +1401,6 @@ class BasicConflictSolver(ConflictSolver):
def __init__(
self,
*,
on_user_attributes_conflict: VersionSelection = VersionSelection.UseOurs,
on_chunk_conflict: VersionSelection = VersionSelection.UseOurs,
fail_on_delete_of_updated_array: bool = False,
fail_on_delete_of_updated_group: bool = False,
Expand All @@ -1411,8 +1409,6 @@ class BasicConflictSolver(ConflictSolver):
Parameters
----------
on_user_attributes_conflict: VersionSelection
The behavior to use when a user attribute conflict is encountered, by default VersionSelection.use_ours()
on_chunk_conflict: VersionSelection
The behavior to use when a chunk conflict is encountered, by default VersionSelection.use_theirs()
fail_on_delete_of_updated_array: bool
Expand Down Expand Up @@ -1476,39 +1472,36 @@ class ConflictType(Enum):
Attributes:
NewNodeConflictsWithExistingNode: int
A new node conflicts with an existing node
NewNodeInInvalidGroup: int
NewNodeInInvalidGroup: tuple[int]
A new node is in an invalid group
ZarrMetadataDoubleUpdate: int
ZarrMetadataDoubleUpdate: tuple[int]
A zarr metadata update conflicts with an existing zarr metadata update
ZarrMetadataUpdateOfDeletedArray: int
ZarrMetadataUpdateOfDeletedArray: tuple[int]
A zarr metadata update is attempted on a deleted array
UserAttributesDoubleUpdate: int
A user attribute update conflicts with an existing user attribute update
UserAttributesUpdateOfDeletedNode: int
A user attribute update is attempted on a deleted node
ChunkDoubleUpdate: int
ZarrMetadataUpdateOfDeletedGroup: tuple[int]
A zarr metadata update is attempted on a deleted group
ChunkDoubleUpdate: tuple[int]
A chunk update conflicts with an existing chunk update
ChunksUpdatedInDeletedArray: int
ChunksUpdatedInDeletedArray: tuple[int]
Chunks are updated in a deleted array
ChunksUpdatedInUpdatedArray: int
ChunksUpdatedInUpdatedArray: tuple[int]
Chunks are updated in an updated array
DeleteOfUpdatedArray: int
DeleteOfUpdatedArray: tuple[int]
A delete is attempted on an updated array
DeleteOfUpdatedGroup: int
DeleteOfUpdatedGroup: tuple[int]
A delete is attempted on an updated group
"""

NewNodeConflictsWithExistingNode = 1
NewNodeInInvalidGroup = 2
ZarrMetadataDoubleUpdate = 3
ZarrMetadataUpdateOfDeletedArray = 4
UserAttributesDoubleUpdate = 5
UserAttributesUpdateOfDeletedNode = 6
ChunkDoubleUpdate = 7
ChunksUpdatedInDeletedArray = 8
ChunksUpdatedInUpdatedArray = 9
DeleteOfUpdatedArray = 10
DeleteOfUpdatedGroup = 11
NewNodeConflictsWithExistingNode = (1,)
NewNodeInInvalidGroup = (2,)
ZarrMetadataDoubleUpdate = (3,)
ZarrMetadataUpdateOfDeletedArray = (4,)
ZarrMetadataUpdateOfDeletedGroup = (5,)
ChunkDoubleUpdate = (6,)
ChunksUpdatedInDeletedArray = (7,)
ChunksUpdatedInUpdatedArray = (8,)
DeleteOfUpdatedArray = (9,)
DeleteOfUpdatedGroup = (10,)

class Conflict:
"""A conflict detected between snapshots"""
Expand Down
33 changes: 12 additions & 21 deletions icechunk-python/src/conflicts.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,12 @@ pub enum PyConflictType {
NewNodeInInvalidGroup = 2,
ZarrMetadataDoubleUpdate = 3,
ZarrMetadataUpdateOfDeletedArray = 4,
UserAttributesDoubleUpdate = 5,
UserAttributesUpdateOfDeletedNode = 6,
ChunkDoubleUpdate = 7,
ChunksUpdatedInDeletedArray = 8,
ChunksUpdatedInUpdatedArray = 9,
DeleteOfUpdatedArray = 10,
DeleteOfUpdatedGroup = 11,
ZarrMetadataUpdateOfDeletedGroup = 5,
ChunkDoubleUpdate = 6,
ChunksUpdatedInDeletedArray = 7,
ChunksUpdatedInUpdatedArray = 8,
DeleteOfUpdatedArray = 9,
DeleteOfUpdatedGroup = 10,
}

impl Display for PyConflictType {
Expand All @@ -31,13 +30,12 @@ impl Display for PyConflictType {
}
PyConflictType::NewNodeInInvalidGroup => "New node in invalid group",
PyConflictType::ZarrMetadataDoubleUpdate => "Zarr metadata double update",
PyConflictType::ZarrMetadataUpdateOfDeletedGroup => {
"Zarr metadata update of deleted group"
}
PyConflictType::ZarrMetadataUpdateOfDeletedArray => {
"Zarr metadata update of deleted array"
}
PyConflictType::UserAttributesDoubleUpdate => "User attributes double update",
PyConflictType::UserAttributesUpdateOfDeletedNode => {
"User attributes update of deleted node"
}
PyConflictType::ChunkDoubleUpdate => "Chunk double update",
PyConflictType::ChunksUpdatedInDeletedArray => {
"Chunks updated in deleted array"
Expand Down Expand Up @@ -108,13 +106,8 @@ impl From<&Conflict> for PyConflict {
path: path.to_string(),
conflicted_chunks: None,
},
Conflict::UserAttributesDoubleUpdate { path, node_id: _ } => PyConflict {
conflict_type: PyConflictType::UserAttributesDoubleUpdate,
path: path.to_string(),
conflicted_chunks: None,
},
Conflict::UserAttributesUpdateOfDeletedNode(path) => PyConflict {
conflict_type: PyConflictType::UserAttributesUpdateOfDeletedNode,
Conflict::ZarrMetadataUpdateOfDeletedGroup(path) => PyConflict {
conflict_type: PyConflictType::ZarrMetadataUpdateOfDeletedGroup,
path: path.to_string(),
conflicted_chunks: None,
},
Expand Down Expand Up @@ -188,17 +181,15 @@ pub struct PyBasicConflictSolver;
#[pymethods]
impl PyBasicConflictSolver {
#[new]
#[pyo3(signature = (*, on_user_attributes_conflict=PyVersionSelection::UseOurs, on_chunk_conflict=PyVersionSelection::UseOurs, fail_on_delete_of_updated_array = false, fail_on_delete_of_updated_group = false))]
#[pyo3(signature = (*, on_chunk_conflict=PyVersionSelection::UseOurs, fail_on_delete_of_updated_array = false, fail_on_delete_of_updated_group = false))]
fn new(
on_user_attributes_conflict: PyVersionSelection,
on_chunk_conflict: PyVersionSelection,
fail_on_delete_of_updated_array: bool,
fail_on_delete_of_updated_group: bool,
) -> (Self, PyConflictSolver) {
(
Self,
PyConflictSolver(Arc::new(BasicConflictSolver {
on_user_attributes_conflict: on_user_attributes_conflict.into(),
on_chunk_conflict: on_chunk_conflict.into(),
fail_on_delete_of_updated_array,
fail_on_delete_of_updated_group,
Expand Down
34 changes: 14 additions & 20 deletions icechunk-python/src/repository.rs
Original file line number Diff line number Diff line change
Expand Up @@ -195,9 +195,9 @@ pub struct PyDiff {
#[pyo3(get)]
pub deleted_arrays: BTreeSet<String>,
#[pyo3(get)]
pub updated_user_attributes: BTreeSet<String>,
pub updated_groups: BTreeSet<String>,
#[pyo3(get)]
pub updated_zarr_metadata: BTreeSet<String>,
pub updated_arrays: BTreeSet<String>,
#[pyo3(get)]
// A Vec instead of a set to avoid issues with list not being hashable in python
pub updated_chunks: BTreeMap<String, Vec<Vec<u32>>>,
Expand All @@ -213,16 +213,10 @@ impl From<Diff> for PyDiff {
value.deleted_groups.into_iter().map(|path| path.to_string()).collect();
let deleted_arrays =
value.deleted_arrays.into_iter().map(|path| path.to_string()).collect();
let updated_user_attributes = value
.updated_user_attributes
.into_iter()
.map(|path| path.to_string())
.collect();
let updated_zarr_metadata = value
.updated_zarr_metadata
.into_iter()
.map(|path| path.to_string())
.collect();
let updated_groups =
value.updated_groups.into_iter().map(|path| path.to_string()).collect();
let updated_arrays =
value.updated_arrays.into_iter().map(|path| path.to_string()).collect();
let updated_chunks = value
.updated_chunks
.into_iter()
Expand All @@ -238,8 +232,8 @@ impl From<Diff> for PyDiff {
new_arrays,
deleted_groups,
deleted_arrays,
updated_user_attributes,
updated_zarr_metadata,
updated_groups,
updated_arrays,
updated_chunks,
}
}
Expand All @@ -266,17 +260,17 @@ impl PyDiff {
res.push('\n');
}

if !self.updated_zarr_metadata.is_empty() {
res.push_str("Zarr metadata updated:\n");
for g in self.updated_zarr_metadata.iter() {
if !self.updated_groups.is_empty() {
res.push_str("Group definitions updated:\n");
for g in self.updated_groups.iter() {
writeln!(res, " {}", g).unwrap();
}
res.push('\n');
}

if !self.updated_user_attributes.is_empty() {
res.push_str("User attributes updated:\n");
for g in self.updated_user_attributes.iter() {
if !self.updated_arrays.is_empty() {
res.push_str("Array definitions updated:\n");
for g in self.updated_arrays.iter() {
writeln!(res, " {}", g).unwrap();
}
res.push('\n');
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
24 changes: 12 additions & 12 deletions icechunk-python/tests/data/test-repo/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,6 @@ compression: null
caching: null
storage: null
virtual_chunk_containers:
gcs:
name: gcs
url_prefix: gcs
store: !gcs {}
az:
name: az
url_prefix: az
store: !azure {}
tigris:
name: tigris
url_prefix: tigris
Expand All @@ -21,6 +13,18 @@ virtual_chunk_containers:
endpoint_url: https://fly.storage.tigris.dev
anonymous: false
allow_http: false
file:
name: file
url_prefix: file
store: !local_file_system ''
gcs:
name: gcs
url_prefix: gcs
store: !gcs {}
az:
name: az
url_prefix: az
store: !azure {}
s3:
name: s3
url_prefix: s3://
Expand All @@ -29,8 +33,4 @@ virtual_chunk_containers:
endpoint_url: http://localhost:9000
anonymous: false
allow_http: true
file:
name: file
url_prefix: file
store: !local_file_system ''
manifest: null
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"A2RD2Y65PR6D3B6BR1K0"}
{"snapshot":"HNG82GMS51ECXFXFCYJG"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"K1BMYVG1HNVTNV1FSBH0"}
{"snapshot":"GNFK0SSWD5B8CVA53XEG"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"RPA0WQCNM2N9HBBRHJQ0"}
{"snapshot":"3EKE17N8YF5ZK5NRMZJ0"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"6Q9GDTXKF17BGQVSQZFG"}
{"snapshot":"R7F1RJHPZ428N4AK19K0"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"949AXZ49X764TMDC6D4G"}
{"snapshot":"TNE0TX645A2G7VTXFA1G"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"SNF98D1SK7NWD5KQJM20"}
{"snapshot":"394QWZDXAY74HP6Q8P3G"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"A2RD2Y65PR6D3B6BR1K0"}
{"snapshot":"HNG82GMS51ECXFXFCYJG"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"SNF98D1SK7NWD5KQJM20"}
{"snapshot":"394QWZDXAY74HP6Q8P3G"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"949AXZ49X764TMDC6D4G"}
{"snapshot":"TNE0TX645A2G7VTXFA1G"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"SNF98D1SK7NWD5KQJM20"}
{"snapshot":"394QWZDXAY74HP6Q8P3G"}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
4 changes: 2 additions & 2 deletions icechunk-python/tests/test_can_read_old.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,8 +271,8 @@ async def test_icechunk_can_read_old_repo() -> None:
)
assert diff.deleted_groups == set()
assert diff.deleted_arrays == set()
assert diff.updated_user_attributes == set()
assert diff.updated_zarr_metadata == set()
assert diff.updated_groups == set()
assert diff.updated_arrays == set()


if __name__ == "__main__":
Expand Down
Loading

0 comments on commit 0458b9f

Please sign in to comment.