Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate AL concepts docs to icechunk #762

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/docs/concepts/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Best Practices

## Scope of Repos and Commits

### How much data should I store in one repo?

A root of a Repository is a Zarr group, and you can put whatever you want in it.
At one extreme, you could put all of your organization's data in one single Repository.
At the other extreme, you could have hundreds of Repositories, each containing just one single array.
How to decide?

The important principle is that all data within a Repository share a single commit history.
Transactions are scoped to the Repository level. So you should
**keep together in one Repository data that need to be updated in a coordinated way
and tracked with a single version history**.
Beyond this, we recommend that Repositories should be as small as possible, for simplicity.
Data that are not related or interdependent should be kept in separate Repositories.

For example, when ingesting data from a weather prediction model, all the variables from the model
(e.g. temperature, humidity, wind speed) are generated by the same underlying process and
are physically related and interdependent. So all of this data should be versioned together in a single Repository.
However, if ingesting data from multiple different weather prediction models, we recommend each model
be kept in a separate Repository.

### How big should commits be?

Commits should represent a significant update to the state of the Repository which occurs as part of a single "job."
For example, if you maintain a nightly cron job which ingests data into an Icechunk Repository, the entire job should
occur within a single session and make a single commit. Avoid making many small commits within a single job.
Likewise when updating an array in a planned, coordinated way, all of the updates should be part of a single
session and make a single commit.

Distinct jobs managed by separate individuals should not share a session and should make separate commits.
144 changes: 144 additions & 0 deletions docs/docs/concepts/concurrency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Concurrency Modes

Concurrency refers to when multiple operations are performing I/O operations on Icechunk at the same time.
Concurrency is desirable when doing large-scale data processing, as it allows parallel reading / writing,
greatly increasing the throughput of a workflow. Concurrency can take many forms:
- Asynchronous operations
- Multithreading (multiple threads within a single machine)
- Multiprocessing (multiple processes within a single machine)
- Distributed processing (multiple machines working together)

Concurrent reading is never a problem with Icechunk and works well in any scenario.
Concurrent writing is more complicated.
Icechunk allows for two distinct approaches to concurrent writing to repos:

## Cooperative Mode

In **cooperative mode**, the concurrent writes occurs as part of a single job in which the user
can plan how each worker will act. Specifically, the software doing the writing should take
care not to overwrite chunks written by other workers, e.g. by aligning writes with
chunk boundaries. These concurrent writes can happen within a single session
and be committed via a single commit. Cooperative mode is simpler and less expensive,
but requires more planning by the user.

Cooperative writes often occur in jobs managed by a workflow scheduling system like
Dask, Apache Airflow, Prefect, Apache Beam, etc. All of these systems use
directed acyclic graphs (DAGs) to represent computations. A DAG for a cooperative write to an
array with chunk size 10 would look something like this:

```mermaid
graph TD;
CO[check out repo];
CO-->|session ID| A[write region 0:10]
CO-->|session ID| B[write region 10:20]
CO-->|session ID| C[write region 20:30]
CO-->|session ID| D[write region 30:40]
A-->CM
B-->CM
C-->CM
D-->CM
CM([commit session])
```

(Note that each worker's write evenly aligns with the chunk boundaries.)

For frameworks that allow serialization of Python objects and passing them directly between tasks
(e.g. Dask, Prefect), the `arraylake.Repo` object can be passed directly between tasks to maintain
a single session.

Cooperative mode is the recommended way to make large updates to arrays in Icechunk.
Non-cooperative mode should only be used when cooperative mode is not feasible.

## Non-cooperative mode

In **non-cooperative mode**, the concurrent writes come from distinct, uncoordinated processes
which may potentially conflict with each other. In this case, it may be better to have one process
fail rather than make an inconsistent change to the repository.

## Examples
### Example 1: No Conflicts

As an example, consider two sessions writing to to the same array with chunk size 10.
In the first example, the two processes write to different regions of the array (`0:20` and `20:30`)
in a way that evenly aligns with the chunk boundaries:

```mermaid
sequenceDiagram
actor S1 as Session 1 (`0:20`)
participant A as Icechunk
actor S2 as Session 2 (`20:30`)
Note over A: main: C1
S1->>A: check out main
A->>S1: commit C1
S2->>A: check out main
A->>S2: commit C1
S1->>A: write array region 0:20
S2->>A: write array region 20:30
S1->>A: commit session
Note over A: main: C2
A->>S1: success: new commit C2
S2->>A: commit session
A->>S2: sorry, your branch is out of date
S2->>A: can I fast forward?
A->>S2: yes
S2->>A: fast forward and commit session
Note over A: main: C3
A->>S2: success: new commit C3
```

In this case, the optimistic concurrency system is able to resolve the situation and both sessions
can commit successfully.
However, this is significantly more complex and expensive in terms of communication with Icechunk
than using cooperative mode. This approach would not scale well to hundreds of simultaneous commits,
since the fast-forward block would have to loop over and over until finding a consistent state it can commit.
_If possible, it would have been better to use cooperative mode for this update,_
since the writes were aligned with chunk boundaries.

### Example 1: Chunk Conflicts

In a second example, let's consider what happens when the two sessions write to _the same chunk_.
One session writes to `0:20` while the other writes to `15:30`.
They overlap writes on the chunk spanning the range `10:20`.
In this case, only one commit will succeed, and the other will raise an error.
_This is good!_ It means Icechunk helped us avoid a potentially inconsistent update to the array
that would have produced an incorrect end state.
!!! tip
This sort of consistency problem is not possible to detect when using Zarr directly on object storage.

It is now up to the user to decide what to do next.
In the example below, the user's code implements a manual retry by checking out the repo in its
latest state and re-applying the update.

```mermaid
sequenceDiagram
actor S1 as Session 1 (`0:20`)
participant A as Icechunk
actor S2 as Session 2 (`15:30`)
Note over A: main: C2
S1->>A: check out main
A->>S1: commit C1
S2->>A: check out main
A->>S2: commit C1
S1->>A: write array region 0:20
S2->>A: write array region 15:30
S1->>A: commit session
Note over A: main: C2
A->>S1: success: new commit C2
S2->>A: commit session
A->>S2: sorry, your branch is out of date
S2->>A: can I fast forward?
A->>S2: no: conflict detected
Note over S2: At this point, Icechunk raises an error.
Note over S2: User can implement a manual retry.
S2->>A: check out main
A->>S2: commit C2
Note over S2: Now Session 2 can see Session 1 changes.
S2->>A: rewrite array region 15:30
S2->>A: commit session
Note over A: main: C3
A->>S2: success: new commit C3
```

It would not have been possible to have these two updates occur within a single session,
since the updates from 2 would overwrite the updates from 2, or vice versa, in an
unpredictable way.
117 changes: 117 additions & 0 deletions docs/docs/concepts/data-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Data Model

Icechunk is designed around the Zarr data model, widely used in scientific computing, data science, and AI / ML.
The Zarr high-level data model is effectively the same as HDF5.


## Zarr Data Model
### Groups, Arrays, and Chunks

The core data structure in this data model is the **array**.
Arrays have two fundamental properties:

- **shape** - a tuple of integers which specify the dimensions of each axis of the array. A 10 x 10 square array would have shape (10, 10)
- **data type** - a specification of what type of data is found in each element, e.g. integer, float, etc. Different data types have different precision (e.g. 16-bit integer, 64-bit float, etc.)

In Zarr / Icechunk, arrays are split into **chunks**,
A chunk is the minimum unit of data that must be read / written from storage, and thus choices about chunking have strong implications for performance.
Zarr leaves this completely up to the user.
Chunk shape should be chosen based on the anticipated data access pattern for each array
An Icechunk array is not bounded by an individual file and is effectively unlimited in size.

For further organization of data, Icechunk supports **groups** within a single repo.
Group are like folders which contain multiple arrays and or other groups.
Groups enable data to be organized into hierarchical trees.
A common usage pattern is to store multiple arrays in a group representing a NetCDF-style dataset.

Arbitrary JSON-style key-value metadata can be attached to both arrays and groups.

### How Zarr Stores Data

[Zarr V3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) works by storing both metadata and chunk data into a physical storage device
according to a specified system of "keys".
For example, a Zarr array called `myarray`, within a group called `mygroup`, might generate
the following keys in the storage device:

```
# metadata for mygroup
mygroup/zarr.json
# metadata for myarray
mygroup/myarray/zarr.json
# chunks of data
mygroup/myarray/c/0/0
mygroup/myarray/c/0/1
```

In standard Zarr usage, these keys are filenames in a filesystem or object keys in an object storage system.

When writing data, a Zarr implementation will create these keys and populate them with data.

!!! tip
An important point is that **the state of a Zarr dataset is spread over many different keys**, both metadata and chunks.

## Icechunk Data Model

### Snapshots

Every update to an Icechunk store creates a new **snapshot** with a unique ID.
Icechunk users must organize their updates into groups of related operations called **transactions**.
For example, appending a new time slice to multiple arrays should be done as a single transaction, comprising the following steps
1. Update the array metadata to resize the array to accommodate the new elements.
2. Write new chunks for each array in the group.

While the transaction is in progress, none of these changes will be visible to other users of the store.
Once the transaction is committed using `Repository.commit`, a new snapshot is generated.
Readers can only see and use committed snapshots.

### Branches and Tags

Additionally, snapshots occur in a specific linear (i.e. serializable) order within **branch**.
A branch is a mutable reference to a snapshot--a pointer that maps the branch name to a snapshot ID.
The default branch is `main`.
Every commit to the main branch updates this reference.
Icechunk's design protects against the race condition in which two uncoordinated sessions attempt to update the branch at the same time; only one can succeed.

Icechunk also defines **tags**--_immutable_ references to snapshot.
Tags are appropriate for publishing specific releases of a repository or for any application which requires a persistent, immutable identifier to the store state.

### Chunk References

Chunk references are "pointers" to chunks that exist in other files--HDF5, NetCDF, GRIB, etc.
Icechunk can store these references alongside native Zarr chunks as "virtual datasets".
You can then can update these virtual datasets incrementally (overwrite chunks, change metadata, etc.) without touching the underlying files.
Chunk references are stored in "chunk manifest" files.

### How Does It Work?

!!! note
For more detailed explanation, have a look at the [Icechunk spec](./spec.md)

Zarr itself works by storing both metadata and chunk data into a abstract store according to a specified system of "keys".
For example, a 2D Zarr array called `myarray`, within a group called `mygroup`, would generate the following keys:

```
mygroup/zarr.json
mygroup/myarray/zarr.json
mygroup/myarray/c/0/0
mygroup/myarray/c/0/1
```

In standard regular Zarr stores, these key map directly to filenames in a filesystem or object keys in an object storage system.
When writing data, a Zarr implementation will create these keys and populate them with data. When modifying existing arrays or groups, a Zarr implementation will potentially overwrite existing keys with new data.

This is generally not a problem, as long there is only one person or process coordinating access to the data.
However, when multiple uncoordinated readers and writers attempt to access the same Zarr data at the same time, [various consistency problems](./concepts/version-control-system#consistency-problems-with-zarr) problems emerge.
These consistency problems can occur in both file storage and object storage; they are particularly severe in a cloud setting where Zarr is being used as an active store for data that are frequently changed while also being read.

With Icechunk, we keep the same core Zarr data model, but add a layer of indirection between the Zarr keys and the on-disk storage.
The Icechunk library translates between the Zarr keys and the actual on-disk data given the particular context of the user's state.
Icechunk defines a series of interconnected metadata and data files that together enable efficient isolated reading and writing of metadata and chunks.
Once written, these files are immutable.
Icechunk keeps track of every single chunk explicitly in a "chunk manifest".

```mermaid
flowchart TD
zarr-python[Zarr Library] <-- key / value--> icechunk[Icechunk Library]
icechunk <-- data / metadata files --> storage[(Object Storage)]
```
Loading
Loading