-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VirtualiZarr #91
Comments
Thanks @oruebel this does seem relevant. I started looking at those links but haven't really grasped out it works yet. |
Another possibly interesting thing maybe https://docs.hdfgroup.org/hdf5/v1_14/_h5_f__u_g.html#title11 In particular, the |
VirtualiZarr author here - I was about to raise this exact issue! (thanks for making me aware of this repo @alxmrs) Naively I would have thought you all could simply write a virtualizarr reader, and hence add your data formats to the list of formats that can be represented as virtual zarr (zarr-developers/VirtualiZarr#218). For on-disk storage of the byte range references VirtualiZarr can write out to either Kerchunk's format (so you can basically achieve what kerchunk does using VirtualiZarr's API instead of Kerchunk's), or to Icechunk, which is itself a valid Zarr store. I'm not sure about the rest of the scope of this project, but it's clearly so closely related to VirtualiZarr that we should all talk. We have bi-weekly meetings and you're more than welcome to join! |
From your readme:
We would call this a "virtual dataset".
Icechunk would call internal chunks "native chunks" and the chunks in external files "virtual chunks".
This is exactly how Icechunk works, but Icechunk does it in a way that means that full ACID transactions and version control are achieved entirely within the bounds of object storage, i.e. it's "serverless".
Zarr-python has recent optimizations along these lines.
So is Icechunk.
This is exactly what VirtualiZarr + Icechunk/Kerchunk together do.
This seems to be the only blocker - we should find a way to work on this in Zarr together!! cc @rabernat |
Hi @TomNicholas , thanks for reaching out and for that helpful comparison! One thing that's important to us to to be able to create h5py-like objects seamlessly for the lindi/zarr files, so that they can be used with pynwb. I think this is related to what you said was the "only blocker". Generally I'm happy to switch to using VirtualiZarr or icechunk if it can be made to work. Just to give an idea of what we're using this for, here's an example view of an hdf5 file in the cloud And internally, for efficiency, it loads a pre-computed lindi file from That json file can be opened using h5py. So my question is, to what extent can this system specifically be replaced by VirtualiZarr + Icechunk/Kerchunk? There is at least one other concern... URLs in these lindi files are often pointers to DANDI Archive files... and for embargoed datasets these require authentication. Lindi has functionality that can handle this. |
👋 - new to this thread. Zarr/Icechunk dev here.
Interesting! Can you briefly explain how this works?
Icechunk supports multiple virtual chunk containers (data outside of the store itself). Each container points a storage location (e.g. bucket). Users can supply Icechunk with credentials for each virtual chunk container. So yes, you would be able to include data that requires specific credentials to access. |
Hi @jhamman ! In a github action, all new NWB files on Dandi are pre-indexed using Lindi. That just means we lazy read the remote NWB/HDF5 file (no need to download the whole thing) and produce a lindi.json file. That .json file goes to a cloud bucket. Then when a user visits neurosift and supplies a DANDI url for an nwb file, we first check the cloud bucket to see if the lindi json file has been prepared. If so, we use that. Of course the actual chunks of data come from the original file... but all the zarr hierarchy and attributes comes from the json in one shot. LMK if you want more details.
Are you able to use a custom python function to do the auth? Sometimes it can get complicated, and we may have expiring tokens, etc. |
To add to @magland 's answer, LINDI defines I believe we have tested this only for PyNWB (the actual code that interacts with h5py objects is in HDMF) so LindiH5pyFile may not work 100% for other code that works with h5py objects. But as far as we can tell, it works well for our use cases. A few of us have been looking into icechunk but not too deeply yet. It looks very powerful. This project was heavily inspired by kerchunk and we iterated on it quickly to support our use cases, adding customizations like support for links and references, DANDI authentication, special handling of scalar datasets, special handling of structured arrays, and not storing too many chunks in one JSON for efficiency (see description of some of those here). There are probably some features that we have implemented that would be useful in kerchunk/icechunk, and features that have been implemented in kerchunk/icechunk, as well as integrations, that would be useful for us. Now that LINDI is relatively stable for our use cases, it would be nice to merge our efforts on LINDI with kerchunk or icechunk where appropriate! |
Tentatively I think the whole thing could be? 😁
This is extremely similar to the way @ayushnag and @betolink have been using VirtualiZarr inside NASA Earthaccess (see the functions in
That's extremely cool, and I don't know that I've seen anyone do that in the Pangeo sphere before. (It's a similar idea to
There's been a lot of recent work in icechunk on optimizing exactly this.
Can you say more about what this is?
It's on Icechunk's roadmap to support HTTP URLs. |
HDF5 supports links and references, and both can be internal or external. HDF5 links are essentially paths to an object (group or dataset) within a file. HDF5 object references are essentially low-level pointers to other objects. The NWB data standard, which was initially designed around HDF5, uses both links and references to represent relationships between data objects explicitly. For example, a HDF5 group containing microscopy images over time from the brain would be linked to a group containing metadata about the imaging plane and the microscope. A dataset may contain an attribute that is a reference to another dataset, which NWB uses to point to particular indices of a dataset. This functionality is not supported by Zarr (zarr-developers/zarr-python#389), and as far as I know, not supported by Kerchunk. (Though I see this Kerchunk PR: fsspec/kerchunk#463 which adds support for linked internal datasets only.) |
So Icechunk only works with local files? Our main use case with LINDI is remote files. And as I mentioned we need to be able to handle auth in a flexible way with arbitrary python functions being passed in. Another issue I thought of... I have custom javascript code for reading LINDI format in the browser, and it's pretty well tested. Does VirtualiZarr/Icechunk have that? |
As you said internal links are not currently part of zarr's data model. Adding support for them is an interesting question...
If we did want to add internal links to the storage layer then icechunk would be a nice layer to do it in, because it already has a redirection layer (the manifests). I raised earth-mover/icechunk#747 to ask about this idea. I think we might want to chat about your use cases for internal links synchronously though so that I can understand better. (cc @alxmrs)
Icechunk currently supports remote object storage as well as local files, and soon will also support HTTP URLs. The main use case is remote files.
I'm not quite understanding the requirements here - where do the functions need to be passed in? Why do you need functions and not just sets of credentials?
Not yet, but it could and should. Icechunk has an open spec so in theory anyone could write a javascript client, but it would be easier if we can bind a javascript API to the icechunk rust implementation. See some discussion on this here: earth-mover/icechunk#356. I just want to emphasise that all of your questions here are about relatively minor differences between the projects, and mostly asking for things that others have asked for already. I'm confident that we should be trying to collaborate, because the basic requirements here are identical! |
@TomNicholas Thanks for those clarifications. Happy to try to merge efforts. From a practical perspective I am mostly using lindi in other projects rather than actively developing the framework. For now it meets all of my requirements. I'd be open to switching to icechunk, but it needs to tick all the boxes before I do... that includes the things mentioned above (I'm thinking of javascript support, h5py-like object for use with pynwb, flexible handling of authentication for DANDI, representation of HDF5 structions that nwb depends on, etc...) I realize these are obstacles that could be overcome. As I said, right now my focus is on other projects that use lindi, so I'm not looking to spend a lot of development time. But I'll try to help out where I can. |
@rly @magland I just came across VirtualiZarr, which looks related. The effort looks like it somewhat grew out of kerchunk but tries to provide similar functionality in a more Zarr-native form.
The Pangeo / climate modeling community seems to also look at this: https://github.com/esgf2-us/esgf-virtual-zarr-data-access
The text was updated successfully, but these errors were encountered: