Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement LocalCache (1) #54

Merged
merged 12 commits into from
May 9, 2024
Merged

implement LocalCache (1) #54

merged 12 commits into from
May 9, 2024

Conversation

magland
Copy link
Collaborator

@magland magland commented Apr 19, 2024

Not sure if this is the best way to do it, but this is a candidate.

@rly

@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2024

Codecov Report

Attention: Patch coverage is 76.63230% with 68 lines in your changes are missing coverage. Please review.

Project coverage is 82.34%. Comparing base (162f19d) to head (4cd9c1e).

Files Patch % Lines
lindi/LindiRemfile/LindiRemfile.py 73.33% 48 Missing ⚠️
lindi/LindiH5pyFile/LindiH5pyFile.py 66.66% 11 Missing ⚠️
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 53.33% 7 Missing ⚠️
lindi/LocalCache/LocalCache.py 94.11% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #54      +/-   ##
==========================================
- Coverage   83.42%   82.34%   -1.09%     
==========================================
  Files          28       28              
  Lines        1907     2101     +194     
==========================================
+ Hits         1591     1730     +139     
- Misses        316      371      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@magland magland changed the title implement LocalCache implement LocalCache (1) May 9, 2024
@magland
Copy link
Collaborator Author

magland commented May 9, 2024

Changes made:

Implemented a lindi.LocalCache(...) that can be passed in to the LindiH5pyFile constructor. It uses a sqlite database on disk to cache chunks of remote files.

Remove dependency on remfile and instead use an internal LindiRemfile. This is necessary so we can use the same sqlite local cache method as the rest of the package (remfile uses a different system). Recall that when EXTERNAL_ARRAY_LINK is used, an h5py client is used to load dataset chunks (this happens when the number of chunks is too large to include in the .lindi.json file). In this case, the lindi.LindiRemfile() is used, and the optional local_cache object is passed in to lindi.LindiRemfile(). There are some other minor differences between Remfile and LindiRemfile as well.

Replace lindi.LindiH5pyFile.from_reference_file_system(...) with lindi.LindiH5pyFile.from_lindi_file(...) (more intuitive)

Add a convenience function f.write_lindi_file(fname) where f is a lindi.LindiH5pyFile

@@ -79,8 +96,8 @@ import lindi
# URL of the remote .nwb.lindi.json file
url = 'https://kerchunk.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client for the reference file system
client = lindi.LindiH5pyFile.from_reference_file_system(url)
# Load the h5py-like client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Load the h5py-like client
# Load the h5py-like client with the URL of the remote LINDI file (or the path of a local LINDI file)

@rly
Copy link
Contributor

rly commented May 9, 2024

How large can this cache grow? If it is unbounded, then I think it would be useful to set a configurable maximum size on the cache so that it does not accidentally swamp the filesystem, especially since it is stored in a hidden directory, and users could be working with very large datasets.

@magland
Copy link
Collaborator Author

magland commented May 9, 2024

How large can this cache grow? If it is unbounded, then I think it would be useful to set a configurable maximum size on the cache so that it does not accidentally swamp the filesystem, especially since it is stored in a hidden directory, and users could be working with very large datasets.

@rly That's a good idea. How do you think we should manage the configuration? This shouldn't be a global setting since different scripts can utilize different cache directories.

@rly
Copy link
Contributor

rly commented May 9, 2024

It could be a parameter on LocalCache but if might become unintuitive if you use the same cache directory for multiple applications. Let's punt the size limit as a nice-to-have feature.

Caching in fsspec has a lot of different options, such as expiry time, compression, cache chunks or whole files: https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/cached.html
But they do not yet support a cache size limit: fsspec/filesystem_spec#510

Since the LocalCache here is pretty simple, I'm fine with rolling our own, but would it make sense to use any of the fsspec CachingFileSystem classes?

@magland
Copy link
Collaborator Author

magland commented May 9, 2024

Since the LocalCache here is pretty simple, I'm fine with rolling our own, but would it make sense to use any of the fsspec CachingFileSystem classes?

The lindi.LocalCache does something very specific... it caches specific chunks of data of remote files... and that's it. I imagine fsspec's system has a much grander scope, but I haven't looked at it.

@rly
Copy link
Contributor

rly commented May 9, 2024

fsspec's CachingFileSystem has a similar purpose but has a lot of code that hooks into the rest of fsspec. The default approach is different, but interesting - it creates a memory-mapped sparse file that is filled in as blocks are accessed. I think the sqlite approach is more flexible because I think we can more easily manipulate it, but memory-mapped data will probably be faster to load and more amenable to large chunks than blobs in a sqlite db. I haven't carefully examined the details of their approach. In any case, I think this local cache will work great until we need to optimize further.

CachingFileSystem

This class implements chunk-wise local storage of remote files, for quick
access after the initial download. The files are stored in a given
directory with hashes of URLs for the filenames. If no directory is given,
a temporary one is used, which should be cleaned up by the OS after the
process ends. The files themselves are sparse (as implemented in
:class:~fsspec.caching.MMapCache), so only the data which is accessed
takes up space.

@magland magland merged commit 5640e17 into main May 9, 2024
6 checks passed
@magland magland deleted the local-cache branch May 9, 2024 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants