Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start of executable docs #777

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,5 @@ ENV/

# MkDocs documentation
site*/

icechunk-local
7 changes: 6 additions & 1 deletion docs/.readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ build:
os: ubuntu-24.04
tools:
python: "3"
rust: "latest"

jobs:
post_create_environment:
Expand All @@ -14,7 +15,11 @@ build:
- poetry config virtualenvs.create false
post_install:
# Install deps and build using poetry
- . "$READTHEDOCS_VIRTUALENV_PATH/bin/activate" && cd docs && poetry install
- . "$READTHEDOCS_VIRTUALENV_PATH/bin/activate" && cd docs && poetry install && cd ../icechunk-python && maturin develop && cd ../docs
Copy link
Contributor

@dcherian dcherian Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can directly install from github with pip too if we know the commit ID. example:

pip install git+https://github.com/earth-mover/icechunk.git@COMMIT#subdirectory=icechunk-python

This will require maturin in the env, which we seem to have.

# python:
# install:
# - method: pip
# path: icechunk-python

mkdocs:
configuration: docs/mkdocs.yml
24 changes: 12 additions & 12 deletions docs/docs/icechunk-python/parallel.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ including those executed remotely in a multi-processing or any other remote exec
Here is how you can execute such writes with Icechunk, illustrate with a `ThreadPoolExecutor`.
First read some example data, and create an Icechunk Repository.

```python
```python exec="on" session="parallel" source="material-block"
import xarray as xr
import tempfile
from icechunk import Repository, local_filesystem_storage
Expand All @@ -29,25 +29,25 @@ session = repo.writable_session("main")
We will orchestrate so that each task writes one timestep.
This is an arbitrary choice but determines what we set for the Zarr chunk size.

```python
```python exec="on" session="parallel" source="material-block"
chunks = {1 if dim == "time" else ds.sizes[dim] for dim in ds.Tair.dims}
```

Initialize the dataset using [`Dataset.to_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_zarr.html)
and `compute=False`, this will NOT write any chunked array data, but will write all array metadata, and any
in-memory arrays (only `time` in this case).

```python
```python exec="on" session="parallel" source="material-block"
ds.to_zarr(session.store, compute=False, encoding={"Tair": {"chunks": chunks}}, mode="w")
# this commit is optional, but may be useful in your workflow
session.commit("initialize store")
print(session.commit("initialize store"))
```

## Multi-threading

First define a function that constitutes one "write task".

```python
```python exec="on" session="parallel" source="material-block"
from icechunk import Session

def write_timestamp(*, itime: int, session: Session) -> None:
Expand All @@ -59,7 +59,7 @@ def write_timestamp(*, itime: int, session: Session) -> None:

Now execute the writes.

```python
```python exec="on" session="parallel" source="material-block"
from concurrent.futures import ThreadPoolExecutor, wait
from icechunk.distributed import merge_sessions

Expand All @@ -69,12 +69,12 @@ with ThreadPoolExecutor() as executor:
futures = [executor.submit(write_timestamp, itime=i, session=session) for i in range(ds.sizes["time"])]
wait(futures)

session.commit("finished writes")
print(session.commit("finished writes"))
```

Verify that the writes worked as expected:

```python
```python exec="on" session="parallel" source="material-block"
ondisk = xr.open_zarr(repo.readonly_session("main").store, consolidated=False)
xr.testing.assert_identical(ds, ondisk)
```
Expand All @@ -96,7 +96,7 @@ There are three key points to keep in mind:

First we modify `write_task` to return the `Session`:

```python
```python exec="on" session="parallel" source="material-block"
from icechunk import Session

def write_timestamp(*, itime: int, session: Session) -> Session:
Expand All @@ -110,7 +110,7 @@ def write_timestamp(*, itime: int, session: Session) -> Session:
Now we issue write tasks within the [`session.allow_pickling()`](./reference/md#icechunk.Session.allow_pickling) context, gather the Sessions from individual tasks,
merge them, and make a successful commit.

```python
```python exec="on" session="parallel" source="material-block"
from concurrent.futures import ProcessPoolExecutor
from icechunk.distributed import merge_sessions

Expand All @@ -128,12 +128,12 @@ with ProcessPoolExecutor() as executor:

# manually merge the remote sessions in to the local session
session = merge_sessions(session, *sessions)
session.commit("finished writes")
print(session.commit("finished writes"))
```

Verify that the writes worked as expected:

```python
```python exec="on" session="parallel" source="material-block"
ondisk = xr.open_zarr(repo.readonly_session("main").store, consolidated=False)
xr.testing.assert_identical(ds, ondisk)
```
42 changes: 24 additions & 18 deletions docs/docs/icechunk-python/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,16 @@ To get started, let's create a new Icechunk repository.
We recommend creating your repo on a cloud storage platform to get the most out of Icechunk's cloud-native design.
However, you can also create a repo on your local filesystem.

```python exec="on"
# remove local path if it already exists to prevent errors
# this is hidden in the rendered docs
from shutil import rmtree
try:
rmtree("./icechunk-local");
except FileNotFoundError:
pass
```

=== "S3 Storage"

```python
Expand All @@ -57,7 +67,7 @@ However, you can also create a repo on your local filesystem.

=== "Local Storage"

```python
```python exec="on" session="quickstart" source="above"
import icechunk
storage = icechunk.local_filesystem_storage("./icechunk-local")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

storage = icechunk.local_filesystem_storage(tempfile.mkdtemp()))

so we don't have to clear the local folder above?

repo = icechunk.Repository.create(storage)
Expand All @@ -68,13 +78,13 @@ However, you can also create a repo on your local filesystem.
Once the repository is created, we can use `Session`s to read and write data. Since there is no data in the repository yet,
let's create a writable session on the default `main` branch.

```python
```python exec="on" session="quickstart" source="material-block"
session = repo.writable_session("main")
```

Now that we have a session, we can access the `IcechunkStore` from it to interact with the underlying data using `zarr`:

```python
```python exec="on" session="quickstart" source="material-block"
store = session.store # A zarr store
```

Expand All @@ -83,22 +93,23 @@ store = session.store # A zarr store
We can now use our Icechunk `store` with Zarr.
Let's first create a group and an array within it.

```python
```python exec="on" session="quickstart" source="material-block"
import zarr
group = zarr.group(store)
array = group.create("my_array", shape=10, dtype='int32', chunks=(5,))
```

Now let's write some data

```python
```python exec="on" session="quickstart" source="material-block"
array[:] = 1
```

Now let's commit our update using the session

```python
session.commit("first commit")
```python exec="on" session="quickstart" source="material-block"
snapshot_id_1 = session.commit("first commit")
print(snapshot_id_1)
```

🎉 Congratulations! You just made your first Icechunk snapshot.
Expand All @@ -111,7 +122,7 @@ session.commit("first commit")

At this point, we have already committed using our session, so we need to get a new session and store to make more changes.

```python
```python exec="on" session="quickstart" source="material-block"
session_2 = repo.writable_session("main")
store_2 = session_2.store
group = zarr.open_group(store_2)
Expand All @@ -120,38 +131,33 @@ array = group["my_array"]

Let's now put some new data into our array, overwriting the first five elements.

```python
```python exec="on" session="quickstart" source="material-block"
array[:5] = 2
```

...and commit the changes

```python
```python exec="on" session="quickstart" source="material-block"
snapshot_id_2 = session_2.commit("overwrite some values")
```

## Explore version history

We can see the full version history of our repo:

```python
```python exec="on" session="quickstart" source="material-block"
hist = repo.ancestry(snapshot_id=snapshot_id_2)
for ancestor in hist:
print(ancestor.id, ancestor.message, ancestor.written_at)

# Output:
# AHC3TSP5ERXKTM4FCB5G overwrite some values 2024-10-14 14:07:27.328429+00:00
# Q492CAPV7SF3T1BC0AA0 first commit 2024-10-14 14:07:26.152193+00:00
# T7SMDT9C5DZ8MP83DNM0 Repository initialized 2024-10-14 14:07:22.338529+00:00
```

...and we can go back in time to the earlier version.

```python
```python exec="on" session="quickstart" source="material-block"
# latest version
assert array[0] == 2
# check out earlier snapshot
earlier_session = repo.readonly_session(snapshot_id=hist[1].id)
earlier_session = repo.readonly_session(snapshot_id=snapshot_id_1)
store = earlier_session.store

# get the array
Expand Down
49 changes: 49 additions & 0 deletions docs/docs/icechunk-python/quickstart2.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Quickstart\n",
"\n",
"Icechunk is designed to be mostly in the background.\n",
"As a Python user, you'll mostly be interacting with Zarr.\n",
"If you're not familiar with Zarr, you may want to start with the [Zarr Tutorial](https://zarr.readthedocs.io/en/latest/tutorial.html)\n",
"\n",
"## Installation\n",
"\n",
"Icechunk can be installed using pip or conda:\n",
"\n",
"=== \"pip\"\n",
"\n",
" ```bash\n",
" python -m pip install icechunk\n",
" ```\n",
"\n",
"=== \"conda\"\n",
"\n",
" ```bash\n",
" conda install -c conda-forge icechunk\n",
" ```\n",
"\n",
"!!! note\n",
"\n",
" Icechunk is currently designed to support the [Zarr V3 Specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html).\n",
" Using it today requires installing Zarr Python 3.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
13 changes: 9 additions & 4 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ repo_url: https://github.com/earth-mover/icechunk
repo_name: earth-mover/icechunk
copyright: Earthmover PBC # @see overrides/partials/footer.html

strict: true
site_dir: ./.site

extra_css:
Expand Down Expand Up @@ -136,10 +137,12 @@ plugins:
options:
docstring_style: numpy
paths: [../icechunk-python/python]

- mkdocs-jupyter:
include_source: True
#include:
# include: ["*.ipynb"] # Default: ["*.py", "*.ipynb"]
- markdown-exec
# - "icechunk-python/docs/**/*.ipynb"
# - "icechunk-python/docs/docs/icechunk-python/*.ipynb"
# - "icechunk-python/notebooks/*.ipynb"
# - "icechunk-python/examples/*.py"

Expand All @@ -165,6 +168,8 @@ markdown_extensions:
- pymdownx.emoji:
emoji_index: !!python/name:material.extensions.emoji.twemoji
emoji_generator: !!python/name:material.extensions.emoji.to_svg
- toc:
permalink: "#"

nav:
- Home: index.md
Expand All @@ -185,8 +190,8 @@ nav:
- Icechunk for Git Users: icechunk-python/cheatsheets/git-users.md
# - Examples:
# - ... | flat | icechunk-python/examples/*.py
# - Notebooks:
# - ... | flat | icechunk-python/notebooks/*.ipynb
# - Notebooks:
# - ... | flat | icechunk-python/notebooks/*.ipynb
- Icechunk Rust: icechunk-rust.md
- Contributing: contributing.md
- Sample Datasets: sample-datasets.md
Expand Down
Loading