Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support async FSMap objects in zarr.open #2774

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

maxrjones
Copy link
Member

@maxrjones maxrjones commented Jan 28, 2025

This is rough code, but I made some progress on supporting FSMap types and wanted to open a PR for early feedback. This isn't a priority for me, so I'd welcome anyone to take over this PR and/or close it and work in a different PR.

Addresses #2706

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 28, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 28, 2025

I recall there being issues with fsmap before, but I confess I don't really know what an fsmap is -- can someone explain what an fsmap is, how it differs from an fsspec filesystem, and why people would use one over the other?

cc @martindurant

I have a vague feeling that it could be useful to have a Store class that wraps generic MutableMapping instances, and maybe fsmaps could go there, but that requires me knowing more about the user context for fsmap.

@martindurant
Copy link
Member

FSMap was created specifically for the needs of Zarr, and it could have been essentially the same as the v2 FSStore, but was much quicker to get out and working within dask/fsspec.

FSMap is a dict-compatible interface (mutable-mapping) on top of a FS instance, which zarr worked with since forever and ignores some FS functionality like the file-like API.

To make it work with v3 might be complex, because zarr makes async calls, and the FSMap interface is blocking, even if the underlying FS is async. That means that there will be sync() within sync(), which might still work as zarr maintains its own event loop in a thread separate from fsspec's.

@martindurant
Copy link
Member

To be clear: this PR does not use the mapper, but constructs a normal store from the mapper's attributes. I support this path.

@maxrjones
Copy link
Member Author

thanks for the questions and answers, Davis and Martin!

IIUC there are three cases that we'd need to account for:

  1. FSMap wraps an async instance of an async-compatible filesystem
    Solution - as implemented in this PR, extract the wrapped filesystem and use it to open an FsspecStore
  2. FSMap wraps a non-async instance of a non-async-compatible filesystem
    Solution option 1 - extract the wrapped filesystem and wrap in AsyncFileSystemWrapper to open an FsspecStore as in Wrap sync fs for xarray.to_zarr #2533
    Solution option 2 - if it's a "file" protocol, extract the wrapped filesystem and open a LocalStore rather than FsspecStore, for other protocols, wrap in AsyncFileSystemWrapper to open an FsspecStore
  3. FSMap wraps a non-async instance of an async-compatibile filesystem
    Solution - this is the case I'm not sure about. Is there a way to convert from a sync instance to an async instance without needing to wrap it in AsyncFileSystemWrapper? @martindurant could you please offer guidance here?

@martindurant
Copy link
Member

Is there a way to convert from a sync instance to an async instance without needing to wrap it in AsyncFileSystemWrapper

The instance has all the arguments it was made with as attributes, so you can make a new instance with asynchronous=True from those.

if it's a "file" protocol, extract the wrapped filesystem and open a LocalStore rather than FsspecStore

Is there a reason to bother doing this?

@maxrjones
Copy link
Member Author

maxrjones commented Jan 28, 2025

if it's a "file" protocol, extract the wrapped filesystem and open a LocalStore rather than FsspecStore

Is there a reason to bother doing this?

I was thinking based on https://filesystem-spec.readthedocs.io/en/latest/async.html#limitations that LocalStore may be faster since it was designed around providing an async interface.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 28, 2025

IMO it's simpler if the path is always fsmap -> fsspecstore

@martindurant
Copy link
Member

LocalStore may be faster since it was designed around providing an async interface.

I doubt it. The disk is not really async (at least with the standard syscalls python uses), so none of the async code should make any difference for local reads at all.

@maxrjones
Copy link
Member Author

maxrjones commented Feb 7, 2025

I ran into #2808 when trying to wrap sync filesystems. My preference for this PR would be to bring the codecov up to 100% and request that this be considered for review, with the plan that the conversion of sync instances to async and the wrapping of sync filesystems could be handled separately. My reasoning is that some people could use this current implementation and I'm not sure how quickly I'll be able to find time to add the other missing features.

Update: I added conversion of sync instances to async

@maxrjones maxrjones changed the title WIP: Support fsspec mutable mapping objects in zarr.open Support async FSMap objects in zarr.open Feb 13, 2025
@maxrjones maxrjones marked this pull request as ready for review February 13, 2025 01:44
@dcherian
Copy link
Contributor

@martindurant are you able to give this a review today? We'd like to get out a release with this bugfix asap.

@maxrjones
Copy link
Member Author

maxrjones commented Feb 14, 2025

FYI I believe zarr.open should now work with both async and sync instances of async filesystem implementations that are created using get_mapper (e.g., s3, gcs). However, local filesystems seem broken with details in #2808. TBH I'm not sure if there are other sync implementations and whether they work. This bug already exists for the from_url approach as far as I can tell (I'm not sure if xarray gets around this or it just wasn't caught by tests). I'm not sure how to approach the fact that the bug would be exposed more to users with the new from_mapper function. Perhaps we should raise NotImplementedError for sync implementations in from_wrapper to prevent more people from running into the bug?

@dcherian
Copy link
Contributor

Perhaps we should raise NotImplementedError for sync implementations in from_wrapper to prevent more people from running into the bug?

This sounds good to me. It's better to raise informative error messages when we can.

@martindurant
Copy link
Member

how to approach the fact that the bug would be exposed more to users with the new from_mapper function

These people are currently broken, so I don't think that giving them a "doesn't work" error is any better.

@maxrjones
Copy link
Member Author

@martindurant I get some of these warnings if I only run the tests in test/test_store/test_fsspec.py, but not if I run the full test suite. Do you have advice on whether/how this should be avoided?

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x11f2b5d10>
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x118259a90>
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x107b0ec10>

@maxrjones
Copy link
Member Author

how to approach the fact that the bug would be exposed more to users with the new from_mapper function

These people are currently broken, so I don't think that giving them a "doesn't work" error is any better.

Your advice in #2808 (comment) works - in this PR we now raise an informative error if auto_mkdir=True is not set in a LocalFilesystem.

@martindurant
Copy link
Member

I suppose that's OK, but why doesn't the store make directories? Of course, for object stores it doesn't matter...

@maxrjones
Copy link
Member Author

FYI I believe this is ready for review, ideally by @martindurant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants