Skip to content

Commit

Permalink
Merge branch 'main' into ian/issue-template
Browse files Browse the repository at this point in the history
  • Loading branch information
mpiannucci authored Feb 24, 2025
2 parents f1a8e49 + 3bc89ad commit 45080de
Show file tree
Hide file tree
Showing 118 changed files with 2,973 additions and 4,229 deletions.
1 change: 1 addition & 0 deletions .github/workflows/python-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -163,5 +163,6 @@ jobs:
python3 -m venv .venv
source .venv/bin/activate
pip install icechunk['test'] --find-links dist --force-reinstall
pip install pytest-mypy-plugins
# pass xarray's pyproject.toml so that pytest can find the `flaky` fixture
pytest -c=../../xarray/pyproject.toml -W ignore tests/run_xarray_backends_tests.py
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ unwrap_used = "warn"
panic = "warn"
todo = "warn"
unimplemented = "warn"
dbg_macro = "warn"

[workspace.metadata.release]
allow-branch = ["main"]
Expand Down
89 changes: 84 additions & 5 deletions Changelog.python.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,80 @@
# Changelog

## Python Icechunk Library 0.2.2

### Features

- Added the ability to checkout a session `as_of` a specific time. This is useful for replaying what the repo would be at a specific point in time.
- Support for refreshable Google Cloud Storage credentials.

### Fixes

- Fix a bug where the clean prefix detection was hiding other errors when creating repositories.
- API now correctly uses `snapshot_id` instead of `snapshot` consistently.
- Only write `content-type` to metadata files if the target object store supports it.

## Python Icechunk Library 0.2.1

### Features

- Users can now override consistency defaults. With this Icechunk is usable in a larger set of object stores,
including those without support for conditional updates. In this setting, Icechunk loses some of its consistency guarantees.
This configuration variables are for advanced users only, and should only be changed if necessary for compatibility.

```python
class StorageSettings:
...

@property
def unsafe_use_conditional_update(self) -> bool | None:
...
@property
def unsafe_use_conditional_create(self) -> bool | None:
...
@property
def unsafe_use_metadata(self) -> bool | None:
...
```

## Python Icechunk Library 0.2.0

This release is focused on stabilizing Icechunk's on-disk serialization format. It's a non-backwards
compatible change, hopefully the last one. Data written with previous versions must be reingested to be read with
Icechunk 0.2.0.

### Features

- `Repository.ancestry` now returns an iterator, allowing interrupting the traversal of the version tree at any point.
- New on-disk format using [flatbuffers](https://flatbuffers.dev/) makes it easier to document and implement
(de-)serialization. This enables the creation of alternative readers and writers for the Icechunk format.
- `Repository.readonly_session` interprets its first positional argument as a branch name:

```python
# before:
repo.readonly_session(branch="dev")

# after:
repo.readonly_session("dev")

# still possible:
repo.readonly_session(tag="v0.1")
repo.readonly_session(branch="foo")
repo.readonly_session(snapshot_id="NXH3M0HJ7EEJ0699DPP0")
```

- Icechunk is now more resilient to changes in Zarr metadata spec, and can handle Zarr extensions.
- More documentation.

### Performance

- We have improved our benchmarks, making them more flexible and effective at finding possible regressions.
- New `Store.set_virtual_refs` method allows setting multiple virtual chunks for the same array. This
significantly speeds up the creation of virtual datasets.

### Fixes

- Fix a bug in clean prefix detection

## Python Icechunk Library 0.1.3

### Features
Expand All @@ -22,10 +97,13 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
- Icechunk generates logs now. Set the environment variable `ICECHUNK_LOG=icechunk=debug` to print debug logs to stdout. Available "levels" in order of increasing verbosity are `error`, `warn`, `info`, `debug`, `trace`. The default level is `error`. Example log:
![image](https://private-user-images.githubusercontent.com/20792/411051729-7e6de243-73f4-4863-ba79-2dde204fe6e5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTY3NTQsIm5iZiI6MTczODk1NjQ1NCwicGF0aCI6Ii8yMDc5Mi80MTEwNTE3MjktN2U2ZGUyNDMtNzNmNC00ODYzLWJhNzktMmRkZTIwNGZlNmU1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE5MjczNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ1MzdmMDY2MDA2YjdiNzUzM2RhMGE5ZDAxZDA2NWI4ZWU3MjcyZTE0YjRkY2U0ZTZkMTcxMzQzMDVjOGQ0NGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.LnILQIXxOjkR1y6P5w6k9UREm0zOH1tIzt2vrjVcRKM)
- Icechunk can now be installed using `conda`:

```shell
conda install -c conda-forge icechunk
```

- Optionally delete branches and tags that point to expired snapshots:

```python
def expire_snapshots(
self,
Expand All @@ -35,36 +113,35 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
delete_expired_tags: bool = False,
) -> set[str]: ...
```
- More documentation. See [the Icechunk website](https://icechunk.io/)

- More documentation. See [the Icechunk website](https://icechunk.io/)

### Performance

- Faster `exists` zarr `Store` method.
- Implement `Store.getsize_prefix` method. This significantly speeds up `info_complete`.


### Fixes

- Default regular expression to preload manifests.


## Python Icechunk Library 0.1.1

### Fixes

- Session deserialization error when using distributed writes


## Python Icechunk Library 0.1.0

### Features

- Expiration and garbage collection. It's now possible to maintain only recent versions of the repository, reclaiming the storage used exclusively by expired versions.
- Allow an arbitrary map of properties to commits. Example:

```
session.commit("some message", metadata={"author": "icechunk-team"})
```

This properties can be retrieved via `ancestry`.
- New `chunk_coordinates` function to list all initialized chunks in an array.
- It's now possible to delete tags. New tags with the same name won't be allowed to preserve the immutability of snapshots pointed by a tag.
Expand All @@ -89,7 +166,6 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
- Bad manifest split in unmodified arrays
- Documentation was updated to the latest API.


## Python Icechunk Library 0.1.0a15

### Fixes
Expand All @@ -104,6 +180,7 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
- The snapshot now keeps track of the chunk space bounding box for each manifest
- Configuration settings can now be overridden in a field-by-field basis
Example:

```python
config = icechunk.RepositoryConfig(inline_chunk_threshold_byte=0)
storage = ...
Expand All @@ -113,6 +190,7 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
config=config,
)
```

will use 0 for `inline_chunk_threshold_byte` but all other configuration fields will come from
the repository persistent config. If persistent config is not set, configuration defaults will
take its place.
Expand Down Expand Up @@ -147,6 +225,7 @@ on what happened, and what was Icechunk doing when the exception was raised. Exa
config=config,
)
- `ancestry` function can now receive a branch/tag name or a snapshot id

- `set_virtual_ref` can now validate the virtual chunk container exists

```
Expand Down
Binary file added docs/docs/assets/storage/tigris-region-set.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions docs/docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ Icechunk is an open source (Apache 2.0) project and welcomes contributions in th
## Development

### Python Development Workflow
The Python code is developed in the `icechunk-python` subdirectory. To make changes first enter that directory:

```bash
cd icechunk-python
```

Create / activate a virtual environment:

Expand Down Expand Up @@ -43,6 +48,9 @@ Build the project in dev mode:

```bash
maturin develop

# or with the optional dependencies
maturin develop --extras=test,benchmark
```

or build the project in editable mode:
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/icechunk-python/cheatsheets/git-users.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ At this point, the tip of the branch is now the snapshot `198273178639187` and a
In Icechunk, you can view the history of a branch by using the [`repo.ancestry()`](../reference/#icechunk.Repository.ancestry) command, similar to the `git log` command.

```python
repo.ancestry(branch="my-new-branch")
[ancestor for ancestor in repo.ancestry(branch="my-new-branch")]

#[Snapshot(id='198273178639187', ...), ...]
```
Expand Down Expand Up @@ -156,7 +156,7 @@ We can also view the history of a tag by using the [`repo.ancestry()`](../refere
repo.ancestry(tag="my-new-tag")
```

This will return a list of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.
This will return an iterator of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.

```python
repo.lookup_tag("my-new-tag")
Expand Down
4 changes: 0 additions & 4 deletions docs/docs/icechunk-python/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,6 @@ It allows you to configure the following parameters:

The threshold for when to inline a chunk into a manifest instead of storing it as a separate object in the storage backend.

### [`unsafe_overwrite_refs`](./reference.md#icechunk.RepositoryConfig.unsafe_overwrite_refs)

Whether to allow overwriting references in the repository.

### [`get_partial_values_concurrency`](./reference.md#icechunk.RepositoryConfig.get_partial_values_concurrency)

The number of concurrent requests to make when getting partial values from storage.
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/icechunk-python/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ snapshot_id_2 = session_2.commit("overwrite some values")
We can see the full version history of our repo:

```python
hist = repo.ancestry(snapshot=snapshot_id_2)
hist = repo.ancestry(snapshot_id=snapshot_id_2)
for ancestor in hist:
print(ancestor.id, ancestor.message, ancestor.written_at)

Expand All @@ -151,7 +151,7 @@ for ancestor in hist:
# latest version
assert array[0] == 2
# check out earlier snapshot
earlier_session = repo.readonly_session(snapshot=hist[1].id)
earlier_session = repo.readonly_session(snapshot_id=hist[1].id)
store = earlier_session.store

# get the array
Expand Down
78 changes: 8 additions & 70 deletions docs/docs/icechunk-python/version-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ repo = icechunk.Repository.create(icechunk.in_memory_storage())
On creating a new [`Repository`](../reference/#icechunk.Repository), it will automatically create a `main` branch with an initial snapshot. We can take a look at the ancestry of the `main` branch to confirm this.

```python
repo.ancestry(branch="main")
[ancestor for ancestor in repo.ancestry(branch="main")]

# [SnapshotInfo(id="A840RMN5CF807CM66RY0", parent_id=None, written_at=datetime.datetime(2025,1,30,19,52,41,592998, tzinfo=datetime.timezone.utc), message="Repository...")]
```
Expand All @@ -36,7 +36,7 @@ repo.ancestry(branch="main")

The [`ancestry`](./reference/#icechunk.Repository.ancestry) method can be used to inspect the ancestry of any branch, snapshot, or tag.

We get back a list of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
We get back an iterator of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.

## Creating a snapshot

Expand Down Expand Up @@ -270,20 +270,16 @@ session2 = repo.writable_session("main")

root1 = zarr.group(session1.store)
root2 = zarr.group(session2.store)
```

First, we'll modify the attributes of the root group from both sessions.

```python
root1.attrs["foo"] = "bar"
root2.attrs["foo"] = "baz"
root1["data"][0,0] = 1
root2["data"][0,:] = 2
```

and then try to commit the changes.

```python
session1.commit(message="Update foo attribute on root group")
session2.commit(message="Update foo attribute on root group")
session1.commit(message="Update first element of data array")
session2.commit(message="Update first row of data array")

# AE9XS2ZWXT861KD2JGHG
# ---------------------------------------------------------------------------
Expand Down Expand Up @@ -327,65 +323,7 @@ session2.rebase(icechunk.ConflictDetector())
# RebaseFailedError: Rebase failed on snapshot AE9XS2ZWXT861KD2JGHG: 1 conflicts found
```

This however fails because both sessions modified the `foo` attribute on the root group. We can use the `ConflictError` to get more information about the conflict.

```python
try:
session2.rebase(icechunk.ConflictDetector())
except icechunk.RebaseFailedError as e:
print(e.conflicts)

# [Conflict(UserAttributesDoubleUpdate, path=/)]
```

This tells us that the conflict is caused by the two sessions modifying the user attributes of the root group (`/`). In this casewe have decided that second session set the `foo` attribute to the correct value, so we can now try to rebase by instructing the `rebase` method to use the second session's changes with the [`BasicConflictSolver`](../reference/#icechunk.BasicConflictSolver).

```python
session2.rebase(icechunk.BasicConflictSolver(on_user_attributes_conflict=icechunk.VersionSelection.UseOurs))
```

Success! We can now try and commit the changes again.

```python
session2.commit(message="Update foo attribute on root group")

# 'SY4WRE8A9TVYMTJPEAHG'
```

This same process can be used to resolve conflicts with arrays. Let's try to modify the `data` array from both sessions.

```python
session1 = repo.writable_session("main")
session2 = repo.writable_session("main")

root1 = zarr.group(session1.store)
root2 = zarr.group(session2.store)

root1["data"][0,0] = 1
root2["data"][0,:] = 2
```

We have now created a conflict, because the first session modified the first element of the `data` array, and the second session modified the first row of the `data` array. Let's commit the changes from the second session first, then see what conflicts are reported when we try to commit the changes from the first session.

```python
print(session2.commit(message="Update first row of data array"))
print(session1.commit(message="Update first element of data array"))

# ---------------------------------------------------------------------------
# ConflictError Traceback (most recent call last)
# Cell In[15], line 2
# 1 print(session2.commit(message="Update first row of data array"))
# ----> 2 print(session1.commit(message="Update first element of data array"))

# File ~/Developer/icechunk/icechunk-python/python/icechunk/session.py:224, in Session.commit(self, message, metadata)
# 222 return self._session.commit(message, metadata)
# 223 except PyConflictError as e:
# --> 224 raise ConflictError(e) from None

# ConflictError: Failed to commit, expected parent: Some("SY4WRE8A9TVYMTJPEAHG"), actual parent: Some("5XRDGZPSG747AMMRTWT0")
```

Okay! We have a conflict. Lets see what conflicts are reported.
This however fails because both sessions modified metadata. We can use the `RebaseFailedError` to get more information about the conflict.

```python
try:
Expand Down Expand Up @@ -470,4 +408,4 @@ root["data"][:,:]

#### Limitations

At the moment, the rebase functionality is limited to resolving conflicts with attributes on arrays and groups, and conflicts with chunks in arrays. Other types of conflicts are not able to be resolved by icechunk yet and must be resolved manually.
At the moment, the rebase functionality is limited to resolving conflicts with chunks in arrays. Other types of conflicts are not able to be resolved by icechunk yet and must be resolved manually.
2 changes: 1 addition & 1 deletion docs/docs/icechunk-python/xarray.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ xr.open_zarr(session.store, consolidated=False)
We can also read data from previous snapshots by checking out prior versions:

```python
session = repo.readonly_session(snapshot=first_snapshot)
session = repo.readonly_session(snapshot_id=first_snapshot)

xr.open_zarr(session.store, consolidated=False)
# <xarray.Dataset> Size: 9MB
Expand Down
Loading

0 comments on commit 45080de

Please sign in to comment.