Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: create_array(..., data=,...) #2809

Open
DerWeh opened this issue Feb 9, 2025 · 2 comments
Open

DOC: create_array(..., data=,...) #2809

DerWeh opened this issue Feb 9, 2025 · 2 comments
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic

Comments

@DerWeh
Copy link

DerWeh commented Feb 9, 2025

Describe the issue linked to the documentation

I am very confused about the argument data in create_array. A common use case is to simply serialize an in memory array, in which case I tend to pass it as the data=in_memory_array argument. However, I cannot find the data argument in the documentation.

Using IPyhon, on the other hand, zarr.create_array clearly has the argument, while zarr.Group.create_array doesn't seem to expose the interface. I am quite confused about the discrepancy. If this is intentional, please document it.
LLM also suggest that

zarr.create_array("store.zarr", data=in_memory_data)

is more efficient than

arr = zar.create_arra("store.zarr", shape=in_memory_data.shape, dtype=in_memory_data.dtype)
arr[...] = in_memory_data

I have no idea whether this is true or not. zarr.create_array(..., data=in_memory_data) might be indeed more efficient as it seems to be written asynchronously. But the documentation seems to by quite lacking, what the best practice is.


This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr (probably disc bound). Is there a preferred pattern, to use zarr as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.

Suggested fix for documentation

No response

@DerWeh DerWeh added documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic labels Feb 9, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Feb 9, 2025

thanks for this issue @DerWeh. the data keyword argument for create_array was added relatively recently and it looks like I forgot to add it to the Group.create_array method. This should be a simple fix.

This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr (probably disc bound). Is there a preferred pattern, to use zarr as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.

I agree that the documentation should say more about this. Basically all of the non-async functions like (like Array.__setitem__ are designed to take advantage of concurrency. But this concurrency is only useful for performance if your underlying storage layer is actually async. I don't think python's interface to the local file system is asynchronous so there's nothing for you to leverage. But if you were writing to cloud storage like s3, then you would gain a lot of performance from the async layer, even without accessing it.

@DerWeh
Copy link
Author

DerWeh commented Feb 9, 2025

Thanks for the clarification. Adding the data keyword seems simple enough. You're also right with it to be recent, on version 3.0.2 it is in fact documented on rtfd, while on 3.0.1 it's not available yet. Sorry for not making sure that I am reading the latest documentation.

As far as I know, Python's standard library uses synchronous operations for files. There are, however, libraries like aiofiles (which I haven't tried so far). If I understand you correctly, using such a library as storage backend, we could expect performance improvements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

No branches or pull requests

2 participants