Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore integration with Icechunk data engine #5

Open
aufdenkampe opened this issue Feb 17, 2025 · 2 comments
Open

Explore integration with Icechunk data engine #5

aufdenkampe opened this issue Feb 17, 2025 · 2 comments

Comments

@aufdenkampe
Copy link
Member

My vision for this package is that would work seamlessly in cooperation with a local and/or remote high performance data catalog and store (i.e. data engine). Presently, the Icechunk cloud-native transactional tensor storage engine is the most promising option, as it was recently open-sourced by EarthMover as the source code behind their ArrayLake services.

An ideal work flow would be to:

  • User requests a dataset from a well-known data repository for a specific area of interest.
    • These well-known data repos will be cataloged here in a yaml file, and optionally referenced with Kerchunk or VirtualiZarr.
  • This package first checks if the specific dataset has already been fetched and saved to a local Icechunk instance.
  • If not, it fetches the specific dataset from the source repository, saving it locally in it's native format.
  • If the user expects to reuse the data, they can choose to convert the dataset into a cloud-optimized, analysis-ready (ARCO) zarr3 dataset within Icechunk.
aufdenkampe added a commit that referenced this issue Feb 17, 2025
First step toward #5. It was interesting to see that icechunk itself is rather light-weight, with minimal dependencies.
@aufdenkampe
Copy link
Member Author

aufdenkampe commented Feb 17, 2025

UPDATE: After a little reading, it appears that Icechunk's Virtual Datasets are superior to Kerchunk references or VirtualiZarr datasets if a dataset will get updated, because Icechunk has "transactional updates, version controlled history, and faster access speeds."

For VirtualiZarr, "you should not change or add to any of the files comprising the store once created." However, "VirtualiZarr allows you to ingest data as virtual references and write those references into an Icechunk Store." So you can get started in VirtualiZarr then hand it over to Icechunk before making updates.

References:

@aufdenkampe
Copy link
Member Author

aufdenkampe commented Feb 17, 2025

Icechunk presently only supports HDF5, netcdf4, and netcdf3 files for use in virtual references with VirtualiZarr.
https://icechunk.io/en/latest/icechunk-python/virtual/#virtual-reference-file-format-support

VirtualiZarr leverages Kerchunk, as an optional dependency, to create references to COG, FITS, and HDF4 file types, although COG and GRIB support are in the works.

Kerchunk has a wider array of supported file types, including GRIB2, Zarr2, etc.

All this might improve very soon, as the main issue was with the following, which just got merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant