Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single binary lindi using tar #89

Merged
merged 40 commits into from
Sep 17, 2024
Merged

single binary lindi using tar #89

merged 40 commits into from
Sep 17, 2024

Conversation

magland
Copy link
Collaborator

@magland magland commented Aug 2, 2024

A second attempt at a single file binary lindi format (the first attempt was #88). This one uses tar and a special index inspired by https://pypi.org/project/indexedtar/

See examples/write_lindi_binary.py

Will describe more later.

@codecov-commenter
Copy link

codecov-commenter commented Aug 2, 2024

Codecov Report

Attention: Patch coverage is 43.55263% with 429 lines in your changes missing coverage. Please review.

Project coverage is 71.07%. Comparing base (ff05704) to head (1c495e5).

Files Patch % Lines
lindi/tar/lindi_tar.py 12.65% 207 Missing ⚠️
lindi/LindiH5pyFile/LindiH5pyFile.py 57.96% 66 Missing ⚠️
lindi/tar/create_tar_header.py 3.70% 52 Missing ⚠️
lindi/tar/LindiTarStore.py 29.82% 40 Missing ⚠️
...ndi/LindiH5pyFile/LindiReferenceFileSystemStore.py 62.85% 39 Missing ⚠️
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 85.03% 19 Missing ⚠️
...ndi/conversion/create_zarr_dataset_from_h5_data.py 81.25% 3 Missing ⚠️
...indi/LindiH5pyFile/writers/LindiH5pyGroupWriter.py 0.00% 2 Missing ⚠️
lindi/LindiH5ZarrStore/_util.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #89      +/-   ##
==========================================
- Coverage   79.47%   71.07%   -8.40%     
==========================================
  Files          30       33       +3     
  Lines        2265     2939     +674     
==========================================
+ Hits         1800     2089     +289     
- Misses        465      850     +385     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@magland magland mentioned this pull request Aug 2, 2024
@magland
Copy link
Collaborator Author

magland commented Aug 2, 2024

The idea here is to support a lindi.tar format in addition to lindi.json, the advantage being that the .tar file can contain binary chunks -- see the intro comment of #88

The tar contains a top-level file called lindi.json, which is the usual JSON format, but it can refer to other files in the .tar with relative paths. So, if you extract the tar, then you can load the lindi.json without any problem.

But there are a couple challenges to overcome.

First, tar format does not allow deleting or resizing of files. This is a problem when we write to the lindi.tar, because the lindi.json needs to get updated.

The second issue is that .tar is not cloud-friendly in the index cannot get pulled down in one chunk. The files are stored sequentially with a header before each one. So if there are thousands of large files, you may need to make thousands of network requests.

So, to address these issues (and inspired by https://pypi.org/project/indexedtar/) here's what I did

There are two additional files in the .tar: .tar_entry.json and .tar_index.json. The first is a fixed size file that must always be the first in the archive. It contains information about the location in the file of .tar_index.json. This index has all the byte offsets of all the files in the archive (including lindi.json). So once we download that file, we're good to go.

Now let's say we're creating this lindi.tar locally. What happens when we need to update the lindi.json or the .tar_index.json? Well we can pad them with extra whitespace so that they can grow to a certain extent without changing any of the tar headers (we can just overwrite the content). Then if one of these files grows too large, we need to replace it. To do that, we just update the file name in the header to ./trash/xxxxx, effectively deleting it, and then append a new JSON to the end of the archive. And every time a new file (e.g., a blob) is appended to the archive, the index (content) needs to get update. If the index is ever replaced, the .tar_entry.json content needs to get updated - but that never changes size (and remember it's always the first file sequentially).

@rly @oruebel

@magland
Copy link
Collaborator Author

magland commented Aug 2, 2024

Also see this example/test, showing how you can treat the nwb.lindi.tar just like nwb (h5) -- no staging area needed.

https://github.com/NeurodataWithoutBorders/lindi/blob/lindi-tar/examples/example_tar_nwb.py

@magland magland changed the title [WIP] Lindi tar [WIP] single binary lindi using tar Aug 2, 2024
@magland
Copy link
Collaborator Author

magland commented Aug 2, 2024

And here's an example of ammending a remote h5 nwb file. The generated .lindi file has the new dataset chunks in it and also references the remote chunks.

https://github.com/NeurodataWithoutBorders/lindi/blob/lindi-tar/examples/example_ammend_remote_nwb.py

@magland magland changed the title [WIP] single binary lindi using tar single binary lindi using tar Aug 5, 2024
@magland
Copy link
Collaborator Author

magland commented Aug 9, 2024

Resulting from our discussion I decided to require the .tar extension, so that it's not just file.lindi with tar implied. You need to expicitly include the .tar extension. This leaves room for .lindi meaning something else in the future (perhaps a directory).

Important to note. If you extract a file.lindi.tar you will get a lindi.json and other files. You can then open that lindi.json using lindi python OR using fsspec, and it is valid zarr.

@magland magland merged commit 1c495e5 into main Sep 17, 2024
6 checks passed
@magland magland deleted the lindi-tar branch September 17, 2024 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants