Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr-python cannot read arrays saved by tensorstore using the zstd compressor #2056

Open
mkitti opened this issue Jul 26, 2024 · 4 comments · May be fixed by zarr-developers/numcodecs#707
Open
Labels
bug Potential issues with the zarr-python library V2 Affects the v2 branch

Comments

@mkitti
Copy link

mkitti commented Jul 26, 2024

Zarr version

v2.18.2

Numcodecs version

v0.12.1

Python Version

3.12.4

Operating System

Linux

Installation

using conda

Description

I get the following error when trying to open a dataset compressed with tensorstore using the zstd compressor.

RuntimeError: Zstd decompression error: invalid input data

Steps to reproduce

In [8]: ds = ts.open({
   ...:     'driver': 'zarr',
   ...:     'kvstore': {
   ...:         'driver': 'file',
   ...:         'path': 'tmp/zarr_zstd_dataset',
   ...:     },
   ...:     'metadata': {
   ...:         'compressor': {
   ...:             'id': 'zstd',
   ...:             'level': 3,
   ...:         },
   ...:         'shape': [1024, 1024],
   ...:         'chunks': [64, 64],
   ...:         'dtype': '|u1',
   ...:         'dimension_separator': '/',
   ...:     },
   ...:     'create': True,
   ...:     'delete_existing': True,
   ...: }).result()

In [9]: ds[:,:] = 5

In [10]: import zarr

In [11]: arr = zarr.open_array("tmp/zarr_zstd_dataset")

In [12]: arr[:,:]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[12], line 1
----> 1 arr[:,:]

File ~/review_temp/conda/3/x86_64/envs/zarr_python/lib/python3.12/site-packages/zarr/core.py:798, in Array.__getitem__(self, selection)
    796     result = self.vindex[selection]
    797 elif is_pure_orthogonal_indexing(pure_selection, self.ndim):
--> 798     result = self.get_orthogonal_selection(pure_selection, fields=fields)
    799 else:
    800     result = self.get_basic_selection(pure_selection, fields=fields)

File ~/review_temp/conda/3/x86_64/envs/zarr_python/lib/python3.12/site-packages/zarr/core.py:1080, in Array.get_orthogonal_selection(self, selection, out, fields)
   1077 # setup indexer
   1078 indexer = OrthogonalIndexer(selection, self)
-> 1080 return self._get_selection(indexer=indexer, out=out, fields=fields)

File ~/review_temp/conda/3/x86_64/envs/zarr_python/lib/python3.12/site-packages/zarr/core.py:1343, in Array._get_selection(self, indexer, out, fields)
   1340 if math.prod(out_shape) > 0:
   1341     # allow storage to get multiple items at once
   1342     lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
-> 1343     self._chunk_getitems(
   1344         lchunk_coords,
   1345         lchunk_selection,
   1346         out,
   1347         lout_selection,
   1348         drop_axes=indexer.drop_axes,
   1349         fields=fields,
   1350     )
   1351 if out.shape:
   1352     return out

File ~/review_temp/conda/3/x86_64/envs/zarr_python/lib/python3.12/site-packages/zarr/core.py:2183, in Array._chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection, drop_axes, fields)
   2181 for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection):
   2182     if ckey in cdatas:
-> 2183         self._process_chunk(
   2184             out,
   2185             cdatas[ckey],
   2186             chunk_select,
   2187             drop_axes,
   2188             out_is_ndarray,
   2189             fields,
   2190             out_select,
   2191             partial_read_decode=partial_read_decode,
   2192         )
   2193     else:
   2194         # check exception type
   2195         if self._fill_value is not None:

File ~/review_temp/conda/3/x86_64/envs/zarr_python/lib/python3.12/site-packages/zarr/core.py:2096, in Array._process_chunk(self, out, cdata, chunk_selection, drop_axes, out_is_ndarray, fields, out_selection, partial_read_decode)
   2094 except ArrayIndexError:
   2095     cdata = cdata.read_full()
-> 2096 chunk = self._decode_chunk(cdata)
   2098 # select data from chunk
   2099 if fields:

File ~/review_temp/conda/3/x86_64/envs/zarr_python/lib/python3.12/site-packages/zarr/core.py:2352, in Array._decode_chunk(self, cdata, start, nitems, expected_shape)
   2350         chunk = self._compressor.decode_partial(cdata, start, nitems)
   2351     else:
-> 2352         chunk = self._compressor.decode(cdata)
   2353 else:
   2354     chunk = cdata

File numcodecs/zstd.pyx:219, in numcodecs.zstd.Zstd.decode()

File numcodecs/zstd.pyx:153, in numcodecs.zstd.decompress()

RuntimeError: Zstd decompression error: invalid input data

Additional output

$ conda env export
name: zarr_python
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - aiohttp=3.9.5=py312h98912ed_0
  - aiosignal=1.3.1=pyhd8ed1ab_0
  - aom=3.9.1=hac33072_0
  - asciitree=0.3.3=py_2
  - asttokens=2.4.1=pyhd8ed1ab_0
  - attrs=23.2.0=pyh71513ae_0
  - blosc=1.21.6=hef167b5_0
  - brotli-python=1.1.0=py312h30efb56_1
  - bzip2=1.0.8=h4bc722e_7
  - c-ares=1.32.3=h4bc722e_0
  - ca-certificates=2024.7.4=hbcca054_0
  - certifi=2024.7.4=pyhd8ed1ab_0
  - cffi=1.16.0=py312hf06ca03_0
  - charset-normalizer=3.3.2=pyhd8ed1ab_0
  - dav1d=1.2.1=hd590300_0
  - decorator=5.1.1=pyhd8ed1ab_0
  - exceptiongroup=1.2.2=pyhd8ed1ab_0
  - executing=2.0.1=pyhd8ed1ab_0
  - fasteners=0.17.3=pyhd8ed1ab_0
  - frozenlist=1.4.1=py312h98912ed_0
  - fsspec=2024.6.1=pyhff2d567_0
  - h2=4.1.0=pyhd8ed1ab_0
  - hpack=4.0.0=pyh9f0ad1d_0
  - hyperframe=6.0.1=pyhd8ed1ab_0
  - idna=3.7=pyhd8ed1ab_0
  - ipython=8.26.0=pyh707e725_0
  - jedi=0.19.1=pyhd8ed1ab_0
  - keyutils=1.6.1=h166bdaf_0
  - krb5=1.21.3=h659f571_0
  - ld_impl_linux-64=2.40=hf3520f5_7
  - libabseil=20240116.2=cxx17_he02047a_1
  - libavif16=1.1.0=h9b56c87_0
  - libblas=3.9.0=23_linux64_openblas
  - libcblas=3.9.0=23_linux64_openblas
  - libcurl=8.9.0=hdb1bdb2_0
  - libedit=3.1.20191231=he28a2e2_2
  - libev=4.33=hd590300_2
  - libexpat=2.6.2=h59595ed_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=14.1.0=h77fa898_0
  - libgfortran-ng=14.1.0=h69a702a_0
  - libgfortran5=14.1.0=hc5f4f2c_0
  - libgomp=14.1.0=h77fa898_0
  - libjpeg-turbo=3.0.0=hd590300_1
  - liblapack=3.9.0=23_linux64_openblas
  - libnghttp2=1.58.0=h47da74e_1
  - libnsl=2.0.1=hd590300_0
  - libopenblas=0.3.27=pthreads_hac2b453_1
  - libpng=1.6.43=h2797004_0
  - libprotobuf=4.25.3=h08a7969_0
  - libsqlite=3.46.0=hde9e2c9_0
  - libssh2=1.11.0=h0841786_0
  - libstdcxx-ng=14.1.0=hc0a3c3a_0
  - libuuid=2.38.1=h0b41bf4_0
  - libwebp-base=1.4.0=hd590300_0
  - libxcrypt=4.4.36=hd590300_1
  - libzlib=1.3.1=h4ab18f5_1
  - lz4-c=1.9.4=hcb278e6_0
  - matplotlib-inline=0.1.7=pyhd8ed1ab_0
  - ml_dtypes=0.4.0=py312h1d6d2e6_1
  - msgpack-python=1.0.8=py312h2492b07_0
  - multidict=6.0.5=py312h98912ed_0
  - ncurses=6.5=h59595ed_0
  - numcodecs=0.12.1=py312h7070661_1
  - numpy=1.26.4=py312heda63a1_0
  - openssl=3.3.1=h4bc722e_2
  - parso=0.8.4=pyhd8ed1ab_0
  - pexpect=4.9.0=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - pip=24.0=pyhd8ed1ab_0
  - prompt-toolkit=3.0.47=pyha770c72_0
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pure_eval=0.2.3=pyhd8ed1ab_0
  - pybind11-abi=4=hd8ed1ab_3
  - pycparser=2.22=pyhd8ed1ab_0
  - pygments=2.18.0=pyhd8ed1ab_0
  - pysocks=1.7.1=pyha2e5f31_6
  - python=3.12.4=h194c7f8_0_cpython
  - python_abi=3.12=4_cp312
  - rav1e=0.6.6=he8a937b_2
  - readline=8.2=h8228510_1
  - requests=2.32.3=pyhd8ed1ab_0
  - setuptools=71.0.4=pyhd8ed1ab_0
  - six=1.16.0=pyh6c4a22f_0
  - snappy=1.2.1=ha2e4443_0
  - stack_data=0.6.2=pyhd8ed1ab_0
  - svt-av1=2.1.2=hac33072_0
  - tensorstore=0.1.62=py312h7e2185d_0
  - tk=8.6.13=noxft_h4845f30_101
  - traitlets=5.14.3=pyhd8ed1ab_0
  - typing_extensions=4.12.2=pyha770c72_0
  - tzdata=2024a=h0c530f3_0
  - urllib3=2.2.2=pyhd8ed1ab_1
  - wcwidth=0.2.13=pyhd8ed1ab_0
  - wheel=0.43.0=pyhd8ed1ab_1
  - xz=5.2.6=h166bdaf_0
  - yarl=1.9.4=py312h98912ed_0
  - zarr=2.18.2=pyhd8ed1ab_0
  - zstandard=0.23.0=py312h3483029_0
  - zstd=1.5.6=ha6fb4c9_0
prefix: /home/mkitti/review_temp/conda/3/x86_64/envs/zarr_python

xref: google/tensorstore#182

@mkitti mkitti added the bug Potential issues with the zarr-python library label Jul 26, 2024
@mkitti
Copy link
Author

mkitti commented Jul 26, 2024

I previously discussed the root cause of this here:
zarr-developers/numcodecs#519 (comment)

@mkitti
Copy link
Author

mkitti commented Feb 13, 2025

Here's a more compact reproducer. Error exists with zarr-python version 3.0.2.

Reproducer

import zarr
import tensorstore as ts

zarr_path = "reproduce_zarr-python_issue_2056.zarr"

arr = ts.open({
    "driver": "zarr",
    "kvstore": {
        "driver": "file",
        "path": zarr_path
    },
    "key_encoding": "/",
    "metadata": {
        "shape": [1024, 1024],
        "chunks": [128, 128],
        "dtype": "|u1",
        "compressor": {
            "id": "zstd",
            "level": 5
        }
    }
}, create=True, delete_existing=True).result()

arr.write(1).result()

# open with tensorstore
print(f"Opening {zarr_path} with tensorstore")
arr2 = ts.open({
    "driver": "zarr",
    "kvstore": {
        "driver": "file",
        "path": zarr_path
    }
}).result()

# read first chunk with tensorstore
print(f"Reading first chunk with tensorstore")
print(arr2[:128,:128].read().result())

# open with zarr-python
print(f"Opening {zarr_path} with zarr-python")
arr3 = zarr.open(zarr_path)

# read first chunk with zarr-python
print(f"Reading the first chunk with zarr-python")
print(arr3[:128,:128])
# File "numcodecs/zstd.pyx", line 184, in numcodecs.zstd.decompress
# RuntimeError: Zstd decompression error: invalid input data

Output

Opening reproduce_zarr-python_issue_2056.zarr with tensorstore
Reading first chunk with tensorstore
[[1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]]
Opening reproduce_zarr-python_issue_2056.zarr with zarr-python
Reading the first chunk with zarr-python
Traceback (most recent call last):
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/reproduce.py", line 46, in <module>
    print(arr3[:128,:128])
          ~~~~^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/array.py", line 2424, in __getitem__
    return self.get_orthogonal_selection(pure_selection, fields=fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/_compat.py", line 43, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/array.py", line 2866, in get_orthogonal_selection
    return sync(
           ^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/sync.py", line 142, in sync
    raise return_result
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/array.py", line 1286, in _get_selection
    await self.codec_pipeline.read(
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 453, in read
    await concurrent_map(
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 68, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 66, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 270, in read_batch
    chunk_array_batch = await self.decode_batch(
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 177, in decode_batch
    chunk_array_batch = await ab_codec.decode(
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/abc/codec.py", line 129, in decode
    return await _batching_helper(self._decode_single, chunks_and_specs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/abc/codec.py", line 407, in _batching_helper
    return await concurrent_map(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 68, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 66, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/abc/codec.py", line 420, in wrap
    return await func(chunk, chunk_spec)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/site-packages/zarr/codecs/_v2.py", line 36, in _decode_single
    chunk = await asyncio.to_thread(self.compressor.decode, cdata)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/numcodecs/reproducer/.pixi/envs/default/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "numcodecs/zstd.pyx", line 253, in numcodecs.zstd.Zstd.decode
  File "numcodecs/zstd.pyx", line 184, in numcodecs.zstd.decompress
RuntimeError: Zstd decompression error: invalid input data

pixi.toml

[project]
name = "reproducer"
version = "0.1.0"
description = "Add a short description here"
authors = ["Mark Kittisopikul <markkitt@gmail.com>"]
channels = ["conda-forge"]
platforms = ["linux-64"]

[tasks]

[dependencies]
zarr = ">=3.0.2,<4"
tensorstore = ">=0.1.65,<0.2"

@mkitti
Copy link
Author

mkitti commented Feb 13, 2025

Non-reproduction

The problem does not occur if Tensorstore writes a Zarr v3 array because the frame content header contains a known frame size.

import zarr
import tensorstore as ts

zarr_path = "nonreproduce_zarr-python_issue_2056.zarr"

arr = ts.open({
    "driver": "zarr3",
    "kvstore": {
        "driver": "file",
        "path": zarr_path
    },
    "metadata": {
        "shape": [1024, 1024],
        "chunk_grid": {
            "name": "regular",
            "configuration": {
                "chunk_shape": [128, 128]
            }
        },
        "data_type": "uint8",
        "codecs": [{
            "name": "zstd",
            "configuration": {
                "level": 5
            }
        }]
    }
}, create=True, delete_existing=True).result()

arr.write(1).result()

# open with tensorstore
print(f"Opening {zarr_path} with tensorstore")
arr2 = ts.open({
    "driver": "zarr3",
    "kvstore": {
        "driver": "file",
        "path": zarr_path
    }
}).result()

# read first chunk with tensorstore
print(f"Reading first chunk with tensorstore")
print(arr2[:128,:128].read().result())

# open with zarr-python
print(f"Opening {zarr_path} with zarr-python")
arr3 = zarr.open(zarr_path)

# read first chunk with zarr-python
print(f"Reading the first chunk with zarr-python")
print(arr3[:128,:128])

Output

Opening nonreproduce_zarr-python_issue_2056.zarr with tensorstore
Reading first chunk with tensorstore
[[1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]]
Opening nonreproduce_zarr-python_issue_2056.zarr with zarr-python
Reading the first chunk with zarr-python
[[1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]]

@mkitti
Copy link
Author

mkitti commented Feb 13, 2025

One indication of the difference between the reproducer and non-reproducer is inforamtion about the compressed file from the zstd command line utility. The -l option shows that the chunk that reproduces the issue has an unknown uncompressed size. The chunk that does not reproduce the issue has a known size.

$ zstd -l reproduce_zarr-python_issue_2056.zarr/0/0
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0      21   B                        None  reproduce_zarr-python_issue_2056.zarr/0/0

$ zstd -l nonreproduce_zarr-python_issue_2056.zarr/c/0/0 
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0      19   B      16.0 KiB  862.316   None  nonreproduce_zarr-python_issue_2056.zarr/c/0/0

Note that the command line utility can decompress either.

$ zstd -d reproduce_zarr-python_issue_2056.zarr/0/0 -o 0.raw
reproduce_zarr-python_issue_2056.zarr/0/0: 16384 bytes                         

$ zstd -d nonreproduce_zarr-python_issue_2056.zarr/c/0/0 -o 0.z3.raw
nonreproduce_zarr-python_issue_2056.zarr/c/0/0: 16384 bytes                    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library V2 Affects the v2 branch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants