Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VirtualiZarr #91

Open
oruebel opened this issue Aug 13, 2024 · 13 comments
Open

VirtualiZarr #91

oruebel opened this issue Aug 13, 2024 · 13 comments

Comments

@oruebel
Copy link

oruebel commented Aug 13, 2024

@rly @magland I just came across VirtualiZarr, which looks related. The effort looks like it somewhat grew out of kerchunk but tries to provide similar functionality in a more Zarr-native form.

The Pangeo / climate modeling community seems to also look at this: https://github.com/esgf2-us/esgf-virtual-zarr-data-access

@magland
Copy link
Collaborator

magland commented Aug 14, 2024

Thanks @oruebel this does seem relevant. I started looking at those links but haven't really grasped out it works yet.

@oruebel
Copy link
Author

oruebel commented Aug 20, 2024

Another possibly interesting thing maybe https://docs.hdfgroup.org/hdf5/v1_14/_h5_f__u_g.html#title11 In particular, the family, split, and multi driver allow splitting data across multiple files under the hood.

@TomNicholas
Copy link

VirtualiZarr author here - I was about to raise this exact issue! (thanks for making me aware of this repo @alxmrs)

Naively I would have thought you all could simply write a virtualizarr reader, and hence add your data formats to the list of formats that can be represented as virtual zarr (zarr-developers/VirtualiZarr#218).

For on-disk storage of the byte range references VirtualiZarr can write out to either Kerchunk's format (so you can basically achieve what kerchunk does using VirtualiZarr's API instead of Kerchunk's), or to Icechunk, which is itself a valid Zarr store.

I'm not sure about the rest of the scope of this project, but it's clearly so closely related to VirtualiZarr that we should all talk. We have bi-weekly meetings and you're more than welcome to join!

@TomNicholas
Copy link

From your readme:

In the JSON format, the hierarchical group structure, attributes, and small datasets are stored in a JSON structure, with references to larger data chunks stored in external files (inspired by kerchunk). This format is human-readable and easily inspected and edited.

We would call this a "virtual dataset".

The binary format is a .tar file that contains the JSON file (lindi.json) along with optional internal data chunks referenced by the JSON file, in addition to external chunks.

Icechunk would call internal chunks "native chunks" and the chunks in external files "virtual chunks".

This format can be used to create a new NWB file that builds on an existing NWB file without duplicating it and adds new data objects (see below).

This is exactly how Icechunk works, but Icechunk does it in a way that means that full ACID transactions and version control are achieved entirely within the bounds of object storage, i.e. it's "serverless".

By downloading a condensed JSON file, the entire group structure can be retrieved in a single request, facilitating efficient loading of NWB files.

Zarr-python has recent optimizations along these lines.

When comparing LINDI to Zarr it should be noted that LINDI files are in fact valid Zarr archives that can be accessed via the Zarr API.

So is Icechunk.

HDF5 is not well-suited for cloud environments because accessing a remote HDF5 file often requires a large number of small requests to retrieve metadata before larger data chunks can be downloaded. LINDI addresses this by storing the entire group structure in a single JSON file, which can be downloaded in one request.

This is exactly what VirtualiZarr + Icechunk/Kerchunk together do.

Finally, Zarr does not natively support certain features utilized by NWB, such as compound data types and references.

This seems to be the only blocker - we should find a way to work on this in Zarr together!!

cc @rabernat

@magland
Copy link
Collaborator

magland commented Feb 5, 2025

Hi @TomNicholas , thanks for reaching out and for that helpful comparison!

One thing that's important to us to to be able to create h5py-like objects seamlessly for the lindi/zarr files, so that they can be used with pynwb. I think this is related to what you said was the "only blocker".

Generally I'm happy to switch to using VirtualiZarr or icechunk if it can be made to work.

Just to give an idea of what we're using this for, here's an example view of an hdf5 file in the cloud
https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/37ca1798-b14c-4224-b8f0-037e27725336/download/&dandisetId=000409&dandisetVersion=draft

And internally, for efficiency, it loads a pre-computed lindi file from
https://lindi.neurosift.org/dandi/dandisets/000409/assets/37ca1798-b14c-4224-b8f0-037e27725336/nwb.lindi.json

That json file can be opened using h5py.

So my question is, to what extent can this system specifically be replaced by VirtualiZarr + Icechunk/Kerchunk?

There is at least one other concern... URLs in these lindi files are often pointers to DANDI Archive files... and for embargoed datasets these require authentication. Lindi has functionality that can handle this.

@jhamman
Copy link

jhamman commented Feb 5, 2025

👋 - new to this thread. Zarr/Icechunk dev here.

And internally, for efficiency, it loads a pre-computed lindi file from https://lindi.neurosift.org/dandi/dandisets/000409/assets/37ca1798-b14c-4224-b8f0-037e27725336/nwb.lindi.json

That json file can be opened using h5py.

Interesting! Can you briefly explain how this works?

There is at least one other concern... URLs in these lindi files are often pointers to DANDI Archive files... and for embargoed datasets these require authentication. Lindi has functionality that can handle this.

Icechunk supports multiple virtual chunk containers (data outside of the store itself). Each container points a storage location (e.g. bucket). Users can supply Icechunk with credentials for each virtual chunk container. So yes, you would be able to include data that requires specific credentials to access.

@magland
Copy link
Collaborator

magland commented Feb 5, 2025

👋 - new to this thread. Zarr/Icechunk dev here.

And internally, for efficiency, it loads a pre-computed lindi file from https://lindi.neurosift.org/dandi/dandisets/000409/assets/37ca1798-b14c-4224-b8f0-037e27725336/nwb.lindi.json
That json file can be opened using h5py.

Interesting! Can you briefly explain how this works?

Hi @jhamman !

In a github action, all new NWB files on Dandi are pre-indexed using Lindi. That just means we lazy read the remote NWB/HDF5 file (no need to download the whole thing) and produce a lindi.json file. That .json file goes to a cloud bucket. Then when a user visits neurosift and supplies a DANDI url for an nwb file, we first check the cloud bucket to see if the lindi json file has been prepared. If so, we use that. Of course the actual chunks of data come from the original file... but all the zarr hierarchy and attributes comes from the json in one shot. LMK if you want more details.

There is at least one other concern... URLs in these lindi files are often pointers to DANDI Archive files... and for embargoed datasets these require authentication. Lindi has functionality that can handle this.

Icechunk supports multiple virtual chunk containers (data outside of the store itself). Each container points a storage location (e.g. bucket). Users can supply Icechunk with credentials for each virtual chunk container. So yes, you would be able to include data that requires specific credentials to access.

Are you able to use a custom python function to do the auth? Sometimes it can get complicated, and we may have expiring tokens, etc.

@rly
Copy link
Contributor

rly commented Feb 5, 2025

To add to @magland 's answer, LINDI defines LindiH5pyFile that extends h5py.File, LindiH5pyGroup that extends h5py.Group, LindiH5pyDataset that extends h5py.Dataset, etc. These classes override most of the common methods in their h5py parent classes to read and write Zarr stores, groups, arrays, etc. instead of HDF5 files, groups, datasets, etc. As a result, the PyNWB software, which accepts an h5py.File and works with h5py classes, can accept a LindiH5pyFile and does not know that it is not actually reading/writing a Zarr store in disguise. LindiH5pyFile can be created from a LINDI JSON file, which is similar to the Kerchunk / reference file system JSON file.

I believe we have tested this only for PyNWB (the actual code that interacts with h5py objects is in HDMF) so LindiH5pyFile may not work 100% for other code that works with h5py objects. But as far as we can tell, it works well for our use cases.

A few of us have been looking into icechunk but not too deeply yet. It looks very powerful. This project was heavily inspired by kerchunk and we iterated on it quickly to support our use cases, adding customizations like support for links and references, DANDI authentication, special handling of scalar datasets, special handling of structured arrays, and not storing too many chunks in one JSON for efficiency (see description of some of those here). There are probably some features that we have implemented that would be useful in kerchunk/icechunk, and features that have been implemented in kerchunk/icechunk, as well as integrations, that would be useful for us. Now that LINDI is relatively stable for our use cases, it would be nice to merge our efforts on LINDI with kerchunk or icechunk where appropriate!

@TomNicholas
Copy link

TomNicholas commented Feb 5, 2025

So my question is, to what extent can this system specifically be replaced by VirtualiZarr + Icechunk/Kerchunk?

Tentatively I think the whole thing could be? 😁

In a github action, all new NWB files on Dandi are pre-indexed using Lindi. That just means we lazy read the remote NWB/HDF5 file (no need to download the whole thing) and produce a lindi.json file. That .json file goes to a cloud bucket. Then when a user visits neurosift and supplies a DANDI url for an nwb file, we first check the cloud bucket to see if the lindi json file has been prepared. If so, we use that. Of course the actual chunks of data come from the original file... but all the zarr hierarchy and attributes comes from the json in one shot. LMK if you want more details.

This is extremely similar to the way @ayushnag and @betolink have been using VirtualiZarr inside NASA Earthaccess (see the functions in earthaccess/dmrpp_zarr.py, which convert a pre-existing NASA metadata format called DMR++ to virtual zarr on the fly when requested by the earthaccess user). We could then load this data directly into memory using zarr-python (but that requires this WIP PR from @ayushnag to VirtualiZarr first).

To add to @magland 's answer, LINDI defines LindiH5pyFile that extends h5py.File, LindiH5pyGroup that extends h5py.Group, LindiH5pyDataset that extends h5py.Dataset, etc. These classes override most of the common methods in their h5py parent classes to read and write Zarr stores, groups, arrays, etc. instead of HDF5 files, groups, datasets, etc. As a result, the PyNWB software, which accepts an h5py.File and works with h5py classes, can accept a LindiH5pyFile and does not know that it is not actually reading/writing a Zarr store in disguise

That's extremely cool, and I don't know that I've seen anyone do that in the Pangeo sphere before. (It's a similar idea to nczarr but at the python level instead of C level.) But presumably it would be possible to use the same trick make a H5PyFile interface that understands data that actually lives in an Icechunk store? I would call this trick "Virtual HDF5", as it's the inverse of what VirtualiZarr does (at least once you include @ayushnag's PR).

not storing too many chunks in one JSON for efficiency

There's been a lot of recent work in icechunk on optimizing exactly this.

support for links and references

Can you say more about what this is?

There is at least one other concern... URLs in these lindi files are often pointers to DANDI Archive files...

It's on Icechunk's roadmap to support HTTP URLs.

@rly
Copy link
Contributor

rly commented Feb 5, 2025

HDF5 supports links and references, and both can be internal or external. HDF5 links are essentially paths to an object (group or dataset) within a file. HDF5 object references are essentially low-level pointers to other objects. The NWB data standard, which was initially designed around HDF5, uses both links and references to represent relationships between data objects explicitly. For example, a HDF5 group containing microscopy images over time from the brain would be linked to a group containing metadata about the imaging plane and the microscope. A dataset may contain an attribute that is a reference to another dataset, which NWB uses to point to particular indices of a dataset.

This functionality is not supported by Zarr (zarr-developers/zarr-python#389), and as far as I know, not supported by Kerchunk. (Though I see this Kerchunk PR: fsspec/kerchunk#463 which adds support for linked internal datasets only.)

@magland
Copy link
Collaborator

magland commented Feb 6, 2025

It's on Icechunk's roadmap to support HTTP URLs.

So Icechunk only works with local files? Our main use case with LINDI is remote files. And as I mentioned we need to be able to handle auth in a flexible way with arbitrary python functions being passed in.

Another issue I thought of... I have custom javascript code for reading LINDI format in the browser, and it's pretty well tested. Does VirtualiZarr/Icechunk have that?

@TomNicholas
Copy link

HDF5 links are essentially paths to an object (group or dataset) within a file.

As you said internal links are not currently part of zarr's data model. Adding support for them is an interesting question... xarray.DataTree has a form of links in the "inherited coordinates" feature, but so far has chosen not to complicate the data model by adding general internal links.

I see this Kerchunk PR: fsspec/kerchunk#463 which adds support for linked internal datasets only.

If we did want to add internal links to the storage layer then icechunk would be a nice layer to do it in, because it already has a redirection layer (the manifests). I raised earth-mover/icechunk#747 to ask about this idea.

I think we might want to chat about your use cases for internal links synchronously though so that I can understand better. (cc @alxmrs)

It's on Icechunk's roadmap to support HTTP URLs.

So Icechunk only works with local files? Our main use case with LINDI is remote files.

Icechunk currently supports remote object storage as well as local files, and soon will also support HTTP URLs. The main use case is remote files.

And as I mentioned we need to be able to handle auth in a flexible way with arbitrary python functions being passed in.

I'm not quite understanding the requirements here - where do the functions need to be passed in? Why do you need functions and not just sets of credentials?

Another issue I thought of... I have custom javascript code for reading LINDI format in the browser, and it's pretty well tested. Does VirtualiZarr/Icechunk have that?

Not yet, but it could and should. Icechunk has an open spec so in theory anyone could write a javascript client, but it would be easier if we can bind a javascript API to the icechunk rust implementation. See some discussion on this here: earth-mover/icechunk#356.


I just want to emphasise that all of your questions here are about relatively minor differences between the projects, and mostly asking for things that others have asked for already. I'm confident that we should be trying to collaborate, because the basic requirements here are identical!

@magland
Copy link
Collaborator

magland commented Feb 19, 2025

@TomNicholas Thanks for those clarifications. Happy to try to merge efforts. From a practical perspective I am mostly using lindi in other projects rather than actively developing the framework. For now it meets all of my requirements. I'd be open to switching to icechunk, but it needs to tick all the boxes before I do... that includes the things mentioned above (I'm thinking of javascript support, h5py-like object for use with pynwb, flexible handling of authentication for DANDI, representation of HDF5 structions that nwb depends on, etc...) I realize these are obstacles that could be overcome. As I said, right now my focus is on other projects that use lindi, so I'm not looking to spend a lot of development time. But I'll try to help out where I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants