You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know the package is very much still under development and I understand that not all features are implemented yet, e.g. the ability to open reference files with inlined references.
What I did
I created a kerchunk reference dataset and inlined all references (except .za* as those need to be not inlined) and tried opening those references again with virtualizarr.
importxarrayasxrimporttempfileimportjsonfrompathlibimportPathimportvirtualizarrasvzimportos#Create xarray datasetds1=xr.Dataset(
{
"a": (("x", "y"), [[1, 2], [3, 4]]),
"b": (("x", "y"), [[10, 20], [30, 40]]),
},
coords={"x": [10, 20], "y": [1, 2]},
)
ref1=ds1.virtualize.to_kerchunk()
tempdir1=Path(tempfile.TemporaryDirectory().name)
defoutline_references(ref: dict, folder: Path=None) ->dict:
""" Virtualizarr currently does not support inlined references. To open references with virtualizarr, the references must be written to a file. Except the .zarray, .zattrs and .zgroup files, all references are written to disk. """refs=ref["refs"]
fork, vinrefs.items():
ifos.path.basename(k).startswith('.'):
continueelifisinstance(v, str):
file=folder/kifnotos.path.exists(os.path.dirname(file)):
os.makedirs(os.path.dirname(file))
withopen(folder/k, "w") asf:
f.write(v)
refs[k] = [str(file), 0, v.__sizeof__()]
returnrefref1=outline_references(ref1, tempdir1)
## Write references to disk (open_virtual_dataset expects a string)withopen("ref1.json", "w") asf:
json.dump(ref1, f)
vds1=vz.open_virtual_dataset("ref1.json", filetype='kerchunk')
What happened
I get several errors when doing vds1 = vz.open_virtual_dataset("ref1.json", filetype='kerchunk'):
File ~/virtualizarr/lib/python3.10/site-packages/virtualizarr/manifests/manifest.py:100, in validate_and_normalize_path_to_uri(path, fs_root)
97 _path = PosixPath(path)
99 if not _path.suffix:
--> 100 raise ValueError(
101 f"entries in the manifest must be paths to files, but this path has no file suffix: {path}"
102 )
104 # only posix paths can possibly not be absolute
105 if not _path.is_absolute():
ValueError: entries in the manifest must be paths to files, but this path has no file suffix: /var/folders/fj/g0x4n_f15tb6zfwjhzc8gvzr0000gn/T/tmpcz41hy4y/x/0
Full traceback
In [10]: vds1=vz.open_virtual_dataset("ref1.json", filetype='kerchunk')
---------------------------------------------------------------------------ValueErrorTraceback (mostrecentcalllast)
CellIn[10], line1---->1vds1=vz.open_virtual_dataset("ref1.json", filetype='kerchunk')
File~/virtualizarr/lib/python3.10/site-packages/virtualizarr/backend.py:203, inopen_virtual_dataset(filepath, filetype, group, drop_variables, loadable_variables, decode_times, cftime_variables, indexes, virtual_array_class, virtual_backend_kwargs, reader_options, backend)
200ifbackend_clsisNone:
201raiseNotImplementedError(f"Unsupported file type: {filetype.name}")
-->203vds=backend_cls.open_virtual_dataset(
204filepath,
205group=group,
206drop_variables=drop_variables,
207loadable_variables=loadable_variables,
208decode_times=decode_times,
209indexes=indexes,
210virtual_backend_kwargs=virtual_backend_kwargs,
211reader_options=reader_options,
212 )
214returnvdsFile~/virtualizarr/lib/python3.10/site-packages/virtualizarr/readers/kerchunk.py:75, inKerchunkVirtualBackend.open_virtual_dataset(filepath, group, drop_variables, loadable_variables, decode_times, indexes, virtual_backend_kwargs, reader_options)
72withfs.open_file() asof:
73refs=ujson.load(of)
--->75vds=dataset_from_kerchunk_refs(KerchunkStoreRefs(refs), fs_root=fs_root)
77else:
78raiseValueError(
79"The input Kerchunk reference did not seem to be in Kerchunk's JSON or Parquet spec: https://fsspec.github.io/kerchunk/spec.html. If your Kerchunk generated references are saved in parquet format, make sure the file extension is `.parquet`. The Kerchunk format autodetection is quite flaky, so if your reference matches the Kerchunk spec feel free to open an issue: https://github.com/zarr-developers/VirtualiZarr/issues"80 )
File~/virtualizarr/lib/python3.10/site-packages/virtualizarr/translators/kerchunk.py:136, indataset_from_kerchunk_refs(refs, drop_variables, virtual_array_class, indexes, fs_root)
119defdataset_from_kerchunk_refs(
120refs: KerchunkStoreRefs,
121drop_variables: list[str] = [],
(...)
124fs_root: str|None=None,
125 ) ->Dataset:
126""" 127 Translate a store-level kerchunk reference dict into an xarray Dataset containing virtualized arrays. 128 (...) 133 Currently can only be ManifestArray, but once VirtualZarrArray is implemented the default should be changed to that. 134 """-->136vars=virtual_vars_from_kerchunk_refs(
137refs, drop_variables, virtual_array_class, fs_root=fs_root138 )
139ds_attrs=fully_decode_arr_refs(refs["refs"]).get(".zattrs", {})
140coord_names=ds_attrs.pop("coordinates", [])
File~/virtualizarr/lib/python3.10/site-packages/virtualizarr/translators/kerchunk.py:110, invirtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class, fs_root)
105drop_variables= []
106var_names_to_keep= [
107var_nameforvar_nameinvar_namesifvar_namenotindrop_variables108 ]
-->110vars= {
111var_name: variable_from_kerchunk_refs(
112refs, var_name, virtual_array_class, fs_root=fs_root113 )
114forvar_nameinvar_names_to_keep115 }
116returnvarsFile~/virtualizarr/lib/python3.10/site-packages/virtualizarr/translators/kerchunk.py:111, in<dictcomp>(.0)
105drop_variables= []
106var_names_to_keep= [
107var_nameforvar_nameinvar_namesifvar_namenotindrop_variables108 ]
110vars= {
-->111var_name: variable_from_kerchunk_refs(
112refs, var_name, virtual_array_class, fs_root=fs_root113 )
114forvar_nameinvar_names_to_keep115 }
116returnvarsFile~/virtualizarr/lib/python3.10/site-packages/virtualizarr/translators/kerchunk.py:169, invariable_from_kerchunk_refs(refs, var_name, virtual_array_class, fs_root)
167dims=zattrs.pop("_ARRAY_DIMENSIONS")
168ifchunk_dict:
-->169manifest=manifest_from_kerchunk_chunk_dict(chunk_dict, fs_root=fs_root)
170varr=virtual_array_class(zarray=zarray, chunkmanifest=manifest)
171eliflen(zarray.shape) !=0:
172# empty variables don't have physical chunks, but zarray shows that the variable173# is at least 1DFile~/virtualizarr/lib/python3.10/site-packages/virtualizarr/translators/kerchunk.py:200, inmanifest_from_kerchunk_chunk_dict(kerchunk_chunk_dict, fs_root)
198elifnotisinstance(v, (tuple, list)):
199raiseTypeError(f"Unexpected type {type(v)} for chunk value: {v}")
-->200chunk_entries[k] =chunkentry_from_kerchunk(v, fs_root=fs_root)
201returnChunkManifest(entries=chunk_entries)
File~/virtualizarr/lib/python3.10/site-packages/virtualizarr/translators/kerchunk.py:217, inchunkentry_from_kerchunk(path_and_byte_range_info, fs_root)
215else:
216path, offset, length=path_and_byte_range_info-->217returnChunkEntry.with_validation( # type: ignore[attr-defined]218path=path, offset=offset, length=length, fs_root=fs_root219 )
File~/virtualizarr/lib/python3.10/site-packages/virtualizarr/manifests/manifest.py:52, inChunkEntry.with_validation(cls, path, offset, length, fs_root)
40""" 41 Constructor which validates each part of the chunk entry. 42 (...) 47 Required if any (likely kerchunk-generated) paths are relative in order to turn them into absolute paths (which virtualizarr requires). 48 """50# note: we can't just use `__init__` or a dataclass' `__post_init__` because we need `fs_root` to be an optional kwarg--->52path=validate_and_normalize_path_to_uri(path, fs_root=fs_root)
53validate_byte_range(offset=offset, length=length)
54returnChunkEntry(path=path, offset=offset, length=length)
File~/virtualizarr/lib/python3.10/site-packages/virtualizarr/manifests/manifest.py:100, invalidate_and_normalize_path_to_uri(path, fs_root)
97_path=PosixPath(path)
99ifnot_path.suffix:
-->100raiseValueError(
101f"entries in the manifest must be paths to files, but this path has no file suffix: {path}"102 )
104# only posix paths can possibly not be absolute105ifnot_path.is_absolute():
ValueError: entriesinthemanifestmustbepathstofiles, butthispathhasnofilesuffix: /var/folders/fj/g0x4n_f15tb6zfwjhzc8gvzr0000gn/T/tmpcz41hy4y/x/0
I head to modify manifest.py both at line 88 and 100 and deactivate these suffix checks to be able to load the data:
elifany(path.startswith(prefix) forprefixinVALID_URI_PREFIXES):
#if not PosixPath(path).suffix:# raise ValueError(# f"entries in the manifest must be paths to files, but this path has no file suffix: {path}"# )returnpath# path is already in URI form
#if not _path.suffix:#raise ValueError(# f"entries in the manifest must be paths to files, but this path has no file suffix: {path}"#
This is obviously not a permanent fix and ignores other cases, but this is how it currently works for me.
I know the package is very much still under development and I understand that not all features are implemented yet, e.g. the ability to open reference files with inlined references.
What I did
I created a kerchunk reference dataset and inlined all references (except
.za*
as those need to be not inlined) and tried opening those references again with virtualizarr.What happened
I get several errors when doing
vds1 = vz.open_virtual_dataset("ref1.json", filetype='kerchunk')
:Full traceback
I head to modify
manifest.py
both at line 88 and 100 and deactivate these suffix checks to be able to load the data:This is obviously not a permanent fix and ignores other cases, but this is how it currently works for me.
(This issue has originally be posted in a modified version at fsspec/kerchunk#536 (comment))
The text was updated successfully, but these errors were encountered: