Replies: 8 comments 17 replies
-
Hi @mdsumner - good question, but to answer it I first need to clarify a few things.
No they aren't. But they are still compatible with the Spec! (at least Icechunk is) That's because the zarr specification is technically not opinionated as to how data is stored on disk. The specification merely states how some arbitrary key-value store interface in any language on any type of storage should behave. That KV store could be the canonical Of course the "native zarr" has become pretty important, as there are lots of tools in various languages which expect this layout of chunks on disk. I myself did not fully understand this distinction until recently, and am often sloppy about it.
That's not a great way to think about it, because using xarray is just an implementation detail of the VirtualiZarr python package. It would be perfectly possible to use some other package that doesn't use xarray to take archival files, extract byte range references, then write them into icechunk. In fact if someone added icechunk support to the kerchunk package that's exactly what you would get. Also whilst the resulting Icechunk store can be read by xarray (via xarray's zarr backend, which calls zarr-python, which calls icechunk's python
Zarr-python reading from Icechunk is already aligned with the v3 standard - it's just not "native zarr" because you're using an Reading icechunk is a bit harder than reading native zarr because the on-disk representation is more complex. But it could still be done, ideally by binding to the canonical icechunk rust crate. For example I have suggested that someone write javascript bindings ("
With this context, your question reduces to "given that icechunk exists, should the community still be trying to shoehorn a virtualization layer into native zarr, or instead just use icechunk for anything virtualized?" See Joe's answer to me asking that question. If somebody does create a non-icechunk layout for virtual stores on disk, virtualizarr could write to that too. We actually did have that for one older layout proposal, but got rid of it because no readers support it.
No - a better way to describe it is that VirtualiZarr uses xarray and kerchunk to help generate (zarr-compatible) Icechunk Stores from pre-existing file formats. |
Beta Was this translation helpful? Give feedback.
-
@TomNicholas gave a great summary of where vzarr fits in here. I should add, that fsspec's ReferenceFileSystem doesn't need to be tied to zarr/vzarr/icechunk at all. It really is just an IO indirection layer. There have been discussion around whether creating references to the binary chunks within parquet/feather2 files is useful, or perhaps finding line-endings in massive and possibly gzipped CSV files. Those are plausible useful workflows you could do, but in practice the ideas have not found an audience, and zarr has been the only consumer for kerchunk output even before vzarr and icechunk existed. |
Beta Was this translation helpful? Give feedback.
-
thank you, lots for me to digest here. I'm still confused about how we can have the manifests abstraction without that being described by the spec. How is an implementor to know that a .json or directory of parquet is a valid source of zarr metadata? (I'm certainly not an experienced implementor of specifications ... fwiw). fsspec is also nice, but it's a defacto standard for one programming language (with analogs in C++ and elsewhere, no grand core spec afaik), I need a lot more experience with it specifically before I can explore much more I think |
Beta Was this translation helpful? Give feedback.
-
As far as kerchunk specs go, there is a spec: https://fsspec.github.io/kerchunk/spec.html ; although that isn't enough to tell you that it's readable by zarr. I'm sure this repo has something similar too. Mostly, datasets defined in virtual references are pointed to from other URLs, and indicate what that URL is, rather than some special name or magic bytes as formats of the past have used. |
Beta Was this translation helpful? Give feedback.
-
awesome, all these answers really appreciated |
Beta Was this translation helpful? Give feedback.
-
ok thanks, this has been bouncing around in my head that the spec is about the references to bytes and not how those are stored ... it turns out R's {pizzarr} and GDAL both work with them, so somehow I had a blocker in my head about what could work:
this one I get complaints about codec "fixedscaleoffset", and in this one about shuffle and fill value:
this gives me a lot more to go on. In R use pizzarr: library(pizzarr) ## https://github.com/keller-mark/pizzarr (also need Bioconductor Rarr for some compress algs)
u <- "https://projects.pawsey.org.au/vzarr/NSIDC_SEAICE_PS_N25km.parquet"
z <- pizzarr::HttpStore$new(u)
z$listdir()
g <- zarr_open_group(z)
g$get_item("ICECON")
Error in get_codec(config) : Unknown codec shuffle
that also gives me the message about shuffle, but metadata is fine (I have a lot to explore now) str(z$get_consolidated_metadata())
List of 2
$ metadata :List of 12
..$ .zattrs :List of 50
.. ..$ Conventions : chr "CF-1.6, ACDD-1.3"
.. ..$ acknowledgment : chr "These data are produced by the NASA Cryospheric Science Program within the Earth Sciences Division under the Sc"| __truncated__
.. ..$ cdm_data_type : chr "grid"
.. ..$ citation : chr "DiGirolamo, N. E., C. L. Parkinson, D. J. Cavalieri, P. Gloersen, and H. J. Zwally. 2022, updated yearly. Sea I"| __truncated__
.. ..$ contributor_name : chr "Nicolo E. DiGirolamo, Claire Parkinson, Per Gloersen, H. J. Zwally, Donald Cavalieri, Walter Meier, J. Scott St"| __truncated__
.. ..$ contributor_role : chr "project_scientist, project_scientist, project_scientist, project_scientist, project_scientist, scientist, scien"| __truncated__
.. ..$ coverage_content_type : chr "image"
.. ..$ date_created : chr "2022-06-28"
.. ..$ date_metadata_modified : chr "2022-06-28"
.. ..$ date_modified : chr "2022-06-28"
.. ..$ geospatial_bounds : chr "POLYGON ((-3850000 5850000, 3750000 5850000, 3750000 -5350000, -3850000 -5350000, -3850000 5850000))"
.. ..$ geospatial_bounds_crs : chr "EPSG:3411"
.. ..$ geospatial_lat_max : chr "90.0"
.. ..$ geospatial_lat_min : chr "30.980564"
.. ..$ geospatial_lat_units : chr "degrees_north"
.. ..$ geospatial_lon_max : chr "180.0"
.. ..$ geospatial_lon_min : chr "-180.0"
.. ..$ geospatial_lon_units : chr "degrees_east"
.. ..$ geospatial_x_resolution : chr "25000.00 meters"
.. ..$ geospatial_x_units : chr "meters"
.. ..$ geospatial_y_resolution : chr "25000.00 meters"
.. ..$ geospatial_y_units : chr "meters"
.. ..$ id : chr "10.5067/MPYG15WAA4WX"
... |
Beta Was this translation helpful? Give feedback.
-
GDAL has gained capacity to read parquet stores from zip urls, and the shuffle filter: I'll see what can be done for pizzarr, fixedscaleoffset seems to be something I could handle too 😃 |
Beta Was this translation helpful? Give feedback.
-
(I'm re-reading and catching up some more as I learn, with notes as I go) thanks! |
Beta Was this translation helpful? Give feedback.
-
As far as I understand, none of the virtualization schemes are actually in the Zarr spec, so it's starting to seem like this is a way of serializing a lazy xarray to a store that borrows structure from Zarr.
Is there scope or ongoing discussions about where this will go? It seems that only Zarr-python
and zarrssupport these stores, and alignment to the v3 standard won't bring those features to other implementations. Any thoughts? @martindurant interested in your take especially 🙏Beta Was this translation helpful? Give feedback.
All reactions