Merge branch 'main' into ian/issue-template

earth-mover · Feb 24, 2025 · 45080de · 45080de
2 parents f1a8e49 + 3bc89ad
commit 45080de
Show file tree

Hide file tree

Showing 118 changed files with 2,973 additions and 4,229 deletions.
diff --git a/.github/workflows/python-check.yaml b/.github/workflows/python-check.yaml
@@ -163,5 +163,6 @@ jobs:
           python3 -m venv .venv
           source .venv/bin/activate
           pip install icechunk['test'] --find-links dist --force-reinstall
+          pip install pytest-mypy-plugins
           # pass xarray's pyproject.toml so that pytest can find the `flaky` fixture
           pytest -c=../../xarray/pyproject.toml -W ignore tests/run_xarray_backends_tests.py
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -9,6 +9,7 @@ unwrap_used = "warn"
 panic = "warn"
 todo = "warn"
 unimplemented = "warn"
+dbg_macro = "warn"
 
 [workspace.metadata.release]
 allow-branch = ["main"]

diff --git a/Changelog.python.md b/Changelog.python.md
@@ -1,5 +1,80 @@
 # Changelog
 
+## Python Icechunk Library 0.2.2
+
+### Features
+
+- Added the ability to checkout a session `as_of` a specific time. This is useful for replaying what the repo would be at a specific point in time.
+- Support for refreshable Google Cloud Storage credentials.
+
+### Fixes
+
+- Fix a bug where the clean prefix detection was hiding other errors when creating repositories.
+- API now correctly uses `snapshot_id` instead of `snapshot` consistently.
+- Only write `content-type` to metadata files if the target object store supports it.
+
+## Python Icechunk Library 0.2.1
+
+### Features
+
+- Users can now override consistency defaults. With this Icechunk is usable in a larger set of object stores,
+including those without support for conditional updates. In this setting, Icechunk loses some of its consistency guarantees.
+This configuration variables are for advanced users only, and should only be changed if necessary for compatibility.
+
+  ```python
+  class StorageSettings:
+    ...
+
+    @property
+    def unsafe_use_conditional_update(self) -> bool | None:
+        ...
+    @property
+    def unsafe_use_conditional_create(self) -> bool | None:
+        ...
+    @property
+    def unsafe_use_metadata(self) -> bool | None:
+        ...
+  ```
+
+## Python Icechunk Library 0.2.0
+
+This release is focused on stabilizing Icechunk's on-disk serialization format. It's a non-backwards
+compatible change, hopefully the last one. Data written with previous versions must be reingested to be read with
+Icechunk 0.2.0.
+
+### Features
+
+- `Repository.ancestry` now returns an iterator, allowing interrupting the traversal of the version tree at any point.
+- New on-disk format using [flatbuffers](https://flatbuffers.dev/) makes it easier to document and implement
+(de-)serialization. This enables the creation of alternative readers and writers for the Icechunk format.
+- `Repository.readonly_session` interprets its first positional argument as a branch name:
+
+```python
+# before:
+repo.readonly_session(branch="dev")
+
+# after:
+repo.readonly_session("dev")
+
+# still possible:
+repo.readonly_session(tag="v0.1")
+repo.readonly_session(branch="foo")
+repo.readonly_session(snapshot_id="NXH3M0HJ7EEJ0699DPP0")
+```
+
+- Icechunk is now more resilient to changes in Zarr metadata spec, and can handle Zarr extensions.
+- More documentation.
+
+### Performance
+
+- We have improved our benchmarks, making them more flexible and effective at finding possible regressions.
+- New `Store.set_virtual_refs` method allows setting multiple virtual chunks for the same array. This
+significantly speeds up the creation of virtual datasets.
+
+### Fixes
+
+- Fix a bug in clean prefix detection
+
 ## Python Icechunk Library 0.1.3
 
 ### Features
@@ -22,10 +97,13 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
 - Icechunk generates logs now. Set the environment variable `ICECHUNK_LOG=icechunk=debug` to print debug logs to stdout. Available "levels" in order of increasing verbosity are `error`, `warn`, `info`, `debug`, `trace`. The default level is `error`. Example log:
   ![image](https://private-user-images.githubusercontent.com/20792/411051729-7e6de243-73f4-4863-ba79-2dde204fe6e5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTY3NTQsIm5iZiI6MTczODk1NjQ1NCwicGF0aCI6Ii8yMDc5Mi80MTEwNTE3MjktN2U2ZGUyNDMtNzNmNC00ODYzLWJhNzktMmRkZTIwNGZlNmU1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE5MjczNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ1MzdmMDY2MDA2YjdiNzUzM2RhMGE5ZDAxZDA2NWI4ZWU3MjcyZTE0YjRkY2U0ZTZkMTcxMzQzMDVjOGQ0NGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.LnILQIXxOjkR1y6P5w6k9UREm0zOH1tIzt2vrjVcRKM)
 - Icechunk can now be installed using `conda`:
+
   ```shell
   conda install -c conda-forge icechunk
   ```
+
 - Optionally delete branches and tags that point to expired snapshots:
+
   ```python
     def expire_snapshots(
         self,
@@ -35,36 +113,35 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
         delete_expired_tags: bool = False,
     ) -> set[str]: ...
   ```
-- More documentation. See [the Icechunk website](https://icechunk.io/)
 
+- More documentation. See [the Icechunk website](https://icechunk.io/)
 
 ### Performance
 
 - Faster `exists` zarr `Store` method.
 - Implement `Store.getsize_prefix` method. This significantly speeds up `info_complete`.
 
-
 ### Fixes
 
 - Default regular expression to preload manifests.
 
-
 ## Python Icechunk Library 0.1.1
 
 ### Fixes
 
 - Session deserialization error when using distributed writes
 
-
 ## Python Icechunk Library 0.1.0
 
 ### Features
 
 - Expiration and garbage collection. It's now possible to maintain only recent versions of the repository, reclaiming the storage used exclusively by expired versions.
 - Allow an arbitrary map of properties to commits. Example:
+
   ```
   session.commit("some message", metadata={"author": "icechunk-team"})
   ```
+
   This properties can be retrieved via `ancestry`.
 - New `chunk_coordinates` function to list all initialized chunks in an array.
 - It's now possible to delete tags. New tags with the same name won't be allowed to preserve the immutability of snapshots pointed by a tag.
@@ -89,7 +166,6 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
 - Bad manifest split in unmodified arrays
 - Documentation was updated to the latest API.
 
-
 ## Python Icechunk Library 0.1.0a15
 
 ### Fixes
@@ -104,6 +180,7 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
 - The snapshot now keeps track of the chunk space bounding box for each manifest
 - Configuration settings can now be overridden in a field-by-field basis
   Example:
+
   ```python
    config = icechunk.RepositoryConfig(inline_chunk_threshold_byte=0)
    storage = ...
@@ -113,6 +190,7 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
        config=config,
    )
   ```
+
   will use 0 for `inline_chunk_threshold_byte` but all other configuration fields will come from
   the repository persistent config. If persistent config is not set, configuration defaults will
   take its place.
@@ -147,6 +225,7 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
        config=config,
    )
 - `ancestry` function can now receive a branch/tag name or a snapshot id
+
 - `set_virtual_ref` can now validate the virtual chunk container exists
 
   ```

diff --git a/docs/docs/assets/storage/tigris-region-set.png b/docs/docs/assets/storage/tigris-region-set.png
diff --git a/docs/docs/contributing.md b/docs/docs/contributing.md
@@ -16,6 +16,11 @@ Icechunk is an open source (Apache 2.0) project and welcomes contributions in th
 ## Development
 
 ### Python Development Workflow
+The Python code is developed in the `icechunk-python` subdirectory. To make changes first enter that directory:
+
+```bash
+cd icechunk-python
+```
 
 Create / activate a virtual environment:
 
@@ -43,6 +48,9 @@ Build the project in dev mode:
 
 ```bash
 maturin develop
+
+# or with the optional dependencies
+maturin develop --extras=test,benchmark
 ```
 
 or build the project in editable mode:

diff --git a/docs/docs/icechunk-python/cheatsheets/git-users.md b/docs/docs/icechunk-python/cheatsheets/git-users.md
@@ -81,7 +81,7 @@ At this point, the tip of the branch is now the snapshot `198273178639187` and a
 In Icechunk, you can view the history of a branch by using the [`repo.ancestry()`](../reference/#icechunk.Repository.ancestry) command, similar to the `git log` command.
 
 ```python
-repo.ancestry(branch="my-new-branch")
+[ancestor for ancestor in repo.ancestry(branch="my-new-branch")]
 
 #[Snapshot(id='198273178639187', ...), ...]
 ```
@@ -156,7 +156,7 @@ We can also view the history of a tag by using the [`repo.ancestry()`](../refere
 repo.ancestry(tag="my-new-tag")
 ```
 
-This will return a list of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.
+This will return an iterator of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.
 
 ```python
 repo.lookup_tag("my-new-tag")

diff --git a/docs/docs/icechunk-python/configuration.md b/docs/docs/icechunk-python/configuration.md
@@ -22,10 +22,6 @@ It allows you to configure the following parameters:
 
 The threshold for when to inline a chunk into a manifest instead of storing it as a separate object in the storage backend.
 
-### [`unsafe_overwrite_refs`](./reference.md#icechunk.RepositoryConfig.unsafe_overwrite_refs)
-
-Whether to allow overwriting references in the repository.
-
 ### [`get_partial_values_concurrency`](./reference.md#icechunk.RepositoryConfig.get_partial_values_concurrency)
 
 The number of concurrent requests to make when getting partial values from storage.

diff --git a/docs/docs/icechunk-python/quickstart.md b/docs/docs/icechunk-python/quickstart.md
@@ -135,7 +135,7 @@ snapshot_id_2 = session_2.commit("overwrite some values")
 We can see the full version history of our repo:
 
 ```python
-hist = repo.ancestry(snapshot=snapshot_id_2)
+hist = repo.ancestry(snapshot_id=snapshot_id_2)
 for ancestor in hist:
     print(ancestor.id, ancestor.message, ancestor.written_at)
 
@@ -151,7 +151,7 @@ for ancestor in hist:
 # latest version
 assert array[0] == 2
 # check out earlier snapshot
-earlier_session = repo.readonly_session(snapshot=hist[1].id)
+earlier_session = repo.readonly_session(snapshot_id=hist[1].id)
 store = earlier_session.store
 
 # get the array

diff --git a/docs/docs/icechunk-python/version-control.md b/docs/docs/icechunk-python/version-control.md
@@ -27,7 +27,7 @@ repo = icechunk.Repository.create(icechunk.in_memory_storage())
 On creating a new [`Repository`](../reference/#icechunk.Repository), it will automatically create a `main` branch with an initial snapshot. We can take a look at the ancestry of the `main` branch to confirm this.
 
 ```python
-repo.ancestry(branch="main")
+[ancestor for ancestor in repo.ancestry(branch="main")]
 
 # [SnapshotInfo(id="A840RMN5CF807CM66RY0", parent_id=None, written_at=datetime.datetime(2025,1,30,19,52,41,592998, tzinfo=datetime.timezone.utc), message="Repository...")]
 ```
@@ -36,7 +36,7 @@ repo.ancestry(branch="main")
 
     The [`ancestry`](./reference/#icechunk.Repository.ancestry) method can be used to inspect the ancestry of any branch, snapshot, or tag.
 
-We get back a list of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
+We get back an iterator of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
 
 ## Creating a snapshot
 
@@ -270,20 +270,16 @@ session2 = repo.writable_session("main")
 
 root1 = zarr.group(session1.store)
 root2 = zarr.group(session2.store)
-```
-
-First, we'll modify the attributes of the root group from both sessions.
 
-```python
-root1.attrs["foo"] = "bar"
-root2.attrs["foo"] = "baz"
+root1["data"][0,0] = 1
+root2["data"][0,:] = 2
 ```
 
 and then try to commit the changes.
 
 ```python
-session1.commit(message="Update foo attribute on root group")
-session2.commit(message="Update foo attribute on root group")
+session1.commit(message="Update first element of data array")
+session2.commit(message="Update first row of data array")
 
 # AE9XS2ZWXT861KD2JGHG
 # ---------------------------------------------------------------------------
@@ -327,65 +323,7 @@ session2.rebase(icechunk.ConflictDetector())
 # RebaseFailedError: Rebase failed on snapshot AE9XS2ZWXT861KD2JGHG: 1 conflicts found
 ```
 
-This however fails because both sessions modified the `foo` attribute on the root group. We can use the `ConflictError` to get more information about the conflict.
-
-```python
-try:
-    session2.rebase(icechunk.ConflictDetector())
-except icechunk.RebaseFailedError as e:
-    print(e.conflicts)
-
-# [Conflict(UserAttributesDoubleUpdate, path=/)]
-```
-
-This tells us that the conflict is caused by the two sessions modifying the user attributes of the root group (`/`). In this casewe have decided that second session set the `foo` attribute to the correct value, so we can now try to rebase by instructing the `rebase` method to use the second session's changes with the [`BasicConflictSolver`](../reference/#icechunk.BasicConflictSolver).
-
-```python
-session2.rebase(icechunk.BasicConflictSolver(on_user_attributes_conflict=icechunk.VersionSelection.UseOurs))
-```
-
-Success! We can now try and commit the changes again.
-
-```python
-session2.commit(message="Update foo attribute on root group")
-
-# 'SY4WRE8A9TVYMTJPEAHG'
-```
-
-This same process can be used to resolve conflicts with arrays. Let's try to modify the `data` array from both sessions.
-
-```python
-session1 = repo.writable_session("main")
-session2 = repo.writable_session("main")
-
-root1 = zarr.group(session1.store)
-root2 = zarr.group(session2.store)
-
-root1["data"][0,0] = 1
-root2["data"][0,:] = 2
-```
-
-We have now created a conflict, because the first session modified the first element of the `data` array, and the second session modified the first row of the `data` array. Let's commit the changes from the second session first, then see what conflicts are reported when we try to commit the changes from the first session.
-
-```python
-print(session2.commit(message="Update first row of data array"))
-print(session1.commit(message="Update first element of data array"))
-
-# ---------------------------------------------------------------------------
-# ConflictError                             Traceback (most recent call last)
-# Cell In[15], line 2
-#      1 print(session2.commit(message="Update first row of data array"))
-# ----> 2 print(session1.commit(message="Update first element of data array"))
-
-# File ~/Developer/icechunk/icechunk-python/python/icechunk/session.py:224, in Session.commit(self, message, metadata)
-#     222     return self._session.commit(message, metadata)
-#     223 except PyConflictError as e:
-# --> 224     raise ConflictError(e) from None
-
-# ConflictError: Failed to commit, expected parent: Some("SY4WRE8A9TVYMTJPEAHG"), actual parent: Some("5XRDGZPSG747AMMRTWT0")
-```
-
-Okay! We have a conflict. Lets see what conflicts are reported.
+This however fails because both sessions modified metadata. We can use the `RebaseFailedError` to get more information about the conflict.
 
 ```python
 try:
@@ -470,4 +408,4 @@ root["data"][:,:]
 
 #### Limitations
 
-At the moment, the rebase functionality is limited to resolving conflicts with attributes on arrays and groups, and conflicts with chunks in arrays. Other types of conflicts are not able to be resolved by icechunk yet and must be resolved manually.
+At the moment, the rebase functionality is limited to resolving conflicts with chunks in arrays. Other types of conflicts are not able to be resolved by icechunk yet and must be resolved manually.
diff --git a/docs/docs/icechunk-python/xarray.md b/docs/docs/icechunk-python/xarray.md
@@ -154,7 +154,7 @@ xr.open_zarr(session.store, consolidated=False)
 We can also read data from previous snapshots by checking out prior versions:
 
 ```python
-session = repo.readonly_session(snapshot=first_snapshot)
+session = repo.readonly_session(snapshot_id=first_snapshot)
 
 xr.open_zarr(session.store, consolidated=False)
 # <xarray.Dataset> Size: 9MB