Repository
can now be pickled.icechunk.print_debug_info()
now prints out relative information about the installed version of icechunk and relative dependencies.icechunk.Storage
now supports__repr__
. Only configuration values will be printed, no credentials.
- Fixes a missing export for Google Cloud Storage credentials.
- Added the ability to checkout a session
as_of
a specific time. This is useful for replaying what the repo would be at a specific point in time. - Support for refreshable Google Cloud Storage credentials.
- Fix a bug where the clean prefix detection was hiding other errors when creating repositories.
- API now correctly uses
snapshot_id
instead ofsnapshot
consistently. - Only write
content-type
to metadata files if the target object store supports it.
-
Users can now override consistency defaults. With this Icechunk is usable in a larger set of object stores, including those without support for conditional updates. In this setting, Icechunk loses some of its consistency guarantees. This configuration variables are for advanced users only, and should only be changed if necessary for compatibility.
class StorageSettings: ... @property def unsafe_use_conditional_update(self) -> bool | None: ... @property def unsafe_use_conditional_create(self) -> bool | None: ... @property def unsafe_use_metadata(self) -> bool | None: ...
This release is focused on stabilizing Icechunk's on-disk serialization format. It's a non-backwards compatible change, hopefully the last one. Data written with previous versions must be reingested to be read with Icechunk 0.2.0.
Repository.ancestry
now returns an iterator, allowing interrupting the traversal of the version tree at any point.- New on-disk format using flatbuffers makes it easier to document and implement (de-)serialization. This enables the creation of alternative readers and writers for the Icechunk format.
Repository.readonly_session
interprets its first positional argument as a branch name:
# before:
repo.readonly_session(branch="dev")
# after:
repo.readonly_session("dev")
# still possible:
repo.readonly_session(tag="v0.1")
repo.readonly_session(branch="foo")
repo.readonly_session(snapshot_id="NXH3M0HJ7EEJ0699DPP0")
- Icechunk is now more resilient to changes in Zarr metadata spec, and can handle Zarr extensions.
- More documentation.
- We have improved our benchmarks, making them more flexible and effective at finding possible regressions.
- New
Store.set_virtual_refs
method allows setting multiple virtual chunks for the same array. This significantly speeds up the creation of virtual datasets.
- Fix a bug in clean prefix detection
- Repositories can now evaluate the
diff
between two snapshots. - Sessions can show the current
status
of the working copy. - Adds the ability to specify bearer tokens for authenticating with Google Cloud Storage.
- Dont write
dimension_names
to the zarr metadata if no dimension names are set. Previously,null
was written.
-
Improved error messages. Exceptions raised by Icechunk now include a lot more information on what happened, and what was Icechunk doing when the exception was raised. Example error message:
-
Icechunk generates logs now. Set the environment variable
ICECHUNK_LOG=icechunk=debug
to print debug logs to stdout. Available "levels" in order of increasing verbosity areerror
,warn
,info
,debug
,trace
. The default level iserror
. Example log: -
Icechunk can now be installed using
conda
:conda install -c conda-forge icechunk
-
Optionally delete branches and tags that point to expired snapshots:
def expire_snapshots( self, older_than: datetime.datetime, *, delete_expired_branches: bool = False, delete_expired_tags: bool = False, ) -> set[str]: ...
-
More documentation. See the Icechunk website
- Faster
exists
zarrStore
method. - Implement
Store.getsize_prefix
method. This significantly speeds upinfo_complete
.
- Default regular expression to preload manifests.
- Session deserialization error when using distributed writes
-
Expiration and garbage collection. It's now possible to maintain only recent versions of the repository, reclaiming the storage used exclusively by expired versions.
-
Allow an arbitrary map of properties to commits. Example:
session.commit("some message", metadata={"author": "icechunk-team"})
This properties can be retrieved via
ancestry
. -
New
chunk_coordinates
function to list all initialized chunks in an array. -
It's now possible to delete tags. New tags with the same name won't be allowed to preserve the immutability of snapshots pointed by a tag.
-
Safety checks on distributed writes via opt-in pickling of the store.
-
More safety around snapshot timestamps, blocking commits if there is too much clock drift.
-
Don't allow creating repositories in dirty prefixes.
-
Experimental support for Tigris object store: it currently requires the bucket to be restricted to a single region to obtain the Icechunk consistency guarantees.
-
This version is the first candidate for a stable on-disk format. At the moment, we are not planning to change the on-disk format prior to releasing v1 but reserve the right to do so.
- Users must now opt-in to pickling and unpickling of Session and IcechunkStore using the
Session.allow_pickling
context manager to_icechunk
now accepts a Session, instead of an IcechunkStore
- Preload small manifests that look like coordinate arrays on session creation.
- Faster
ancestry
in an async context viaasync_ancestry
.
- Bad manifest split in unmodified arrays
- Documentation was updated to the latest API.
- Add a constructor to
RepositoryConfig
-
Now each array has its own chunk manifest, speeding up reads for large repositories
-
The snapshot now keeps track of the chunk space bounding box for each manifest
-
Configuration settings can now be overridden in a field-by-field basis Example:
config = icechunk.RepositoryConfig(inline_chunk_threshold_byte=0) storage = ... repo = icechunk.Repository.open( storage=storage, config=config, )
will use 0 for
inline_chunk_threshold_byte
but all other configuration fields will come from the repository persistent config. If persistent config is not set, configuration defaults will take its place. -
In preparation for on-disk format stability, all metadata files include extensive format information; including a set of magic bytes, file type, spec version, compression format, etc.
- Zarr's
getsize
got orders of magnitude faster because it's implemented natively and with no need of any I/O - We added several performance benchmarks to the repository
- Better configuration for metadata asset caches, now based on their sizes instead of their number
from icechunk import *
no longer fails
-
New
Repository.reopen
function to ope a repo again, overwriting its configuration and/or virtual chunk container credentials -
Configuration classes are now mutable and easier to use:
storage = ... config = icechunk.RepositoryConfig.default() config.storage.concurrency.ideal_concurrent_request_size = 1_000_000 repo = icechunk.Repository.open( storage=storage, config=config, )
-
ancestry
function can now receive a branch/tag name or a snapshot id -
set_virtual_ref
can now validate the virtual chunk container exists
- Better concurrent download of big chunks, both native and virtual
- We no longer allow
main
branch to be deleted
- Adds support for Azure Blob Storage
- Manifests now load faster, due to an improved serialization format
- The store now releases the GIL appropriately in multithreaded contexts
- Large chunks are fetched concurrently
IcechunkStore.list_dir
is now significantly faster- Support for Zarr 3.0 and xarray 2025.1.1
- Transaction logs and snapshot files are compressed
- Manifests compression using Zstd
- Large manifests are fetched using multiple parallel requests
- Functions to fetch and store repository config
- Faster
list_dir
anddelete_dir
implementations in the Zarr store
- Credentials from environment in GCS
- New Python API using
Repository
,Session
andStore
as separate entities - New Python API for configuring and opening
Repositories
- Added support for object store credential refresh
- Persistent repository config
- Commit conflict resolution and rebase support
- Added experimental support for Google Cloud Storage
- Add optional checksums for virtual chunks, either using Etag or last-updated-at
- Support for multiple virtual chunk locations using virtual chunk containers concept
- Added function
all_virtual_chunk_locations
toSession
to retrieve all locations where the repo has data
- Refs were stored in the wrong prefix
- Allow overwriting existing groups and arrays in Icechunk stores
- Fixed an error during commits where chunks would get mixed between different arrays
- Sync with zarr 3.0b2. The biggest change is the
mode
param onIcechunkStore
methods has been simplified toread_only
. - Changed
IcechunkStore::distributed_commit
toIcechunkStore::merge
, which now does not commit, but attempts to merge the changes from another store back into the current store. - Added a new
icechunk.dask.store_dask
method to write a dask array to an icechunk store. This is required for safely writing dask arrays to an icechunk store. - Added a new
icechunk.xarray.to_icechunk
method to write an xarray dataset to an icechunk store. This is required for safely writing xarray datasets with dask arrays to an icechunk store in a distributed or multi-processing context.
- The
StorageConfig
methods have been correctly typed. IcechunkStore
instances are now set toread_only
by default after pickling.- When checking out a snapshot or tag, the
IcechunkStore
will be set to read-only. If you want to write to the store, you must callIcechunkStore::set_writable()
. - An error will now be raised if you try to checkout a snapshot that does not exist.
- Added
IcechunkStore::reset_branch
andIcechunkStore::async_reset_branch
methods to point the head of the current branch to another snapshot, changing the history of the branch
- Zarr metadata will now only include the attributes key when the attributes dictionary of the node is not empty, aligning Icechunk with the python-zarr implementation.
- Initial release