Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr-python v3 compatibility #516

Merged
merged 54 commits into from
Jan 30, 2025
Merged

zarr-python v3 compatibility #516

merged 54 commits into from
Jan 30, 2025

Conversation

mpiannucci
Copy link
Contributor

@mpiannucci mpiannucci commented Oct 10, 2024

So far the only tested file type is HDF5. Thats the only module that currently works with new zarr python in some way

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's much less change here than I might have thought.

kerchunk/hdf.py Outdated
@@ -496,6 +516,8 @@ def _translator(
if h5obj.fletcher32:
logging.info("Discarding fletcher32 checksum")
v["size"] -= 4
key = str.removeprefix(h5obj.name, "/") + "/" + ".".join(map(str, k))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as what _chunk_key did? Maybe make it a function with a comment saying it's a copy/reimplementation.

By the way, is h5obj.name not actually a string, so you could have done h5obj.name.removeprefix()?

kerchunk/hdf.py Outdated
shape=h5obj.shape,
dtype=dt or h5obj.dtype,
chunks=h5obj.chunks or False,
fill_value=fill,
compression=None,
compressor=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, you could reintroduce the compressor

filters = filters[:-1]
compressor = filters[-1]

but obviously it depends on whether there are indeed any filters at all.

It would still need back compat, since filters-only datasts definitely exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the big issue is that v3 cares about what type of operation it is, and v2w doesnt so moving them around doesnt necessarily fix that bug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there needs to be a change upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kerchunk/hdf.py Outdated
Comment on lines 167 to 173
for k, v in self.store_dict.items():
if isinstance(v, zarr.core.buffer.cpu.Buffer):
key = str.removeprefix(k, "/")
new_keys[key] = v.to_bytes()
keys_to_remove.append(k)
for k in keys_to_remove:
del self.store_dict[k]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the hacky bit and could use some explanations. Even when requesting "v2", zarr makes Buffer objects, and the keys are also wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so two issues here:

  1. the keys we get from hdf are for example /depth/.zarray when then need to be depth/.zarray
  2. we cant jsonify buffers, which is how the internal MemoryStore in v3 stores its data. So we need to convert the buffers to bytes to be serialized

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - would appreciate comments on the code saying this.

@mpiannucci
Copy link
Contributor Author

mpiannucci commented Oct 15, 2024

Also worth noting... zarr 3 doesn't support numcodecs codecs out of the box. There is a pr to help this zarr-developers/numcodecs#524 (see also here for updated version zarr-developers/numcodecs#597) but it would mean a change to codec names which causes an incompatibility. For the initial icechunk examples we handle this in virtualizarr but long term it probably belongs here to work standalone.

@mpiannucci
Copy link
Contributor Author

mpiannucci commented Oct 21, 2024

When this zarr-developers/zarr-python#2425 goes in it should unblock this to work full with zarr python v3.

We will also need to create both numcodecs and zarr v3 codec versions of all the custom kerchunk codecs so that a given dataset can be loaded in either v2 or v3 contexts (say if you kerchunk a grib file, then want to convert those references to an icechunk store)

@martindurant
Copy link
Member

create both numcodecs and zarr v3 codec versions of all the custom kerchunk codecs

Is that a subclass in each case, specifying that it is to be bytes->bytes?

@mpiannucci
Copy link
Contributor Author

mpiannucci commented Oct 21, 2024

Sorry the numcodecs versions exists as they are today. Yes the v3 version would be basically subclassing zarr.abc.codec using the numcodecs implementations of the kerchunk codecs.

Although grib decoder should really be bytes to array i think

@mpiannucci
Copy link
Contributor Author

This is required upstream: fsspec/filesystem_spec#1734

@mpiannucci
Copy link
Contributor Author

mpiannucci commented Oct 23, 2024

Also of note: To get this to work with zarr 3, we pass an fsspec ReferenceFilesystem to a zarr RemoteStore. This works fine with remote filesystems (where the data files live) that have async implementations (s3, http, etc) but does not work when the data files are on filesystem with only a sync implementation (local, etc). The tests heavily depend on the local filesystem, and i'm sure many others do as well so it needs to be figured out how the interaction of sync filesystems and the async zarr RemoteStore requirement work out

@mpiannucci
Copy link
Contributor Author

Grib works as long as read in zarr 2 format.

To be used with zarr 3 the codec needs to be ported over to the new abc.Codec class from zarr

@martindurant
Copy link
Member

OK, only failures now are in test__grib_idx.py , which use datatree. I think this is to do with a leading "/" in the paths...

@maxrjones
Copy link
Contributor

OK, only failures now are in test__grib_idx.py , which use datatree. I think this is to do with a leading "/" in the paths...

🎉 exciting progress! DataTree support is not yet supported fully for Zarr-Python 3, see pydata/xarray#9984. I think a conditional xfail would be the way to go here.

@martindurant martindurant marked this pull request as ready for review January 29, 2025 22:10
@martindurant
Copy link
Member

I don't know how much effort we want to spend finding the deadlock triggered by the HDF5 embed test that is now commented out.

@moradology
Copy link
Contributor

I'd been bashing my head against those grib tests all day and couldn't figure out why, even in a debugger and doing things manually, none of the DataTree nodes had any data...

In the process of messing with this I uncovered what may be a FsspecStore bug in its list method: if the path it is given is "", it will replace all "/" with "" and return some very odd results e.g.
Store keys: ['.zattrs', '.zgroup', 'u.zattrs', 'u.zgroup', 'uinstant.zattrs', 'uinstant.zgroup', 'uinstantheightAboveGround.zattrs', 'uinstantheightAboveGround.zgroup', 'uinstantheightAboveGroundheightAboveGround.zarray', 'uinstantheightAboveGroundheightAboveGround.zattrs', 'uinstantheightAboveGroundheightAboveGround0', 'uinstantheightAboveGroundlatitude.zarray', 'uinstantheightAboveGroundlatitude.zattrs', 'uinstantheightAboveGroundlatitude0.0', 'uinstantheightAboveGroundlongitude.zarray', 'uinstantheightAboveGroundlongitude.zattrs', 'uinstantheightAboveGroundlongitude0.0', 'uinstantheightAboveGroundstep.zarray', 'uinstantheightAboveGroundstep.zattrs', 'uinstantheightAboveGroundstep0', 'uinstantheightAboveGroundtime.zarray', 'uinstantheightAboveGroundtime.zattrs', 'uinstantheightAboveGroundtime0', 'uinstantheightAboveGroundu.zarray', 'uinstantheightAboveGroundu.zattrs', 'uinstantheightAboveGroundu0.0.0.0', 'uinstantheightAboveGroundvalid_time.zarray', 'uinstantheightAboveGroundvalid_time.zattrs', 'uinstantheightAboveGroundvalid_time0.0']

@moradology
Copy link
Contributor

Might want to add this small update as well, since it better captures the intent of the version check so far as I can tell: mpiannucci#6

@martindurant
Copy link
Member

In the process of messing with this I uncovered what may be a FsspecStore bug in its list method: if the path it is given is "", it will replace all "/" with "" and return some very odd results

!!! Please file this as an issue. Maybe it helps with the datatree thing too.

@maxrjones

This comment was marked as outdated.

@TomNicholas
Copy link

Great to see this progress!

The preceding / bug was discussed at length in the Xarray meeting earlier today. We will fix it, but if you can xfail it for now then that would suit us in VirtualiZarr.

@martindurant
Copy link
Member

@TomNicholas , I am skipping that whole test module - all is green

@TomNicholas
Copy link

Wow! Okay please let us know when you release this then ❤️

@martindurant martindurant merged commit d2b00a9 into fsspec:main Jan 30, 2025
4 checks passed
@moradology
Copy link
Contributor

@martindurant Not a huge deal, but I noticed that this PR didn't make it in: mpiannucci#6

Should I cut it against kerchunk main or nah?

@martindurant
Copy link
Member

Actually, it looks like that's now dead code. Following this PR, kechunk only works with zarr3, for better or worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants