Deadlock when trying to open same zarr file with multiple processes #2868
Unanswered
relativityhd
asked this question in
Q&A
Replies: 1 comment 3 replies
-
Zarr by itself is not capable of providing safe concurrent modification of metadata from multiple uncoordinated processes, as in your example. There are inevitable race conditions and deadlocks. It's up to the user's code to avoid these situations. I would highly recommend exploring Icechunk for this scenario. Icechunk augments Zarr with a transactional storage engine. With Icechunk as your store, each process can commit its changes in a safe way via an ACID transaction. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello together, I have a question about a somewhat specific multiprocessing problem.
Long story short, I run into deadlocks while trying to open the same zarr file with multiple processes, here is some minimal code:
This outputs
and then stops doing anything. Reason for this seems to be the
concurrent.future.wait
function in the zarrsync
function, which tries to acquire a threading lock.Btw. the same happens when I only create (and start) a single process.
Is this intentional behavior? What are possible workarounds or better solutions / approaches?
About my concrete use case
I want to build a multiprocessing AND multithreading capable pipeline (user should choose, ideally this makes the pipeline compatible with e.g. dask or ray) which uses the same underlying datacube as auxiliary data.
This datacube should be procedurally downloaded and filled. The original auxiliary data is stored as e.g. GeoTiffs on a server, hence each download has an id. Each Tiff is then stored to the relevant chunks. It could be seen as a very well organized (thanks to zarr and xarray! <3) and reusable download-cache.
I want to limit the number of concurrent downloads to one but still let processes, which doesn't rely on the current (or queued) download to finish, further process the data. For that I catalogue the queued, currently downloaded and already downloaded ids in the zarr attrs.
And to have access to these, I need to open the zarr array in multiple processes.
Maybe this could help to understand my vision step-by-step:
Imaging I want to process three images a, b and c simultaneously. a depends on the geotiffs 1 and 2, b depends only on 1 and c depends on the geotiffs 2 and 3.
The geotiff 1 is already stored in the datacube.
This approach works very well in a non-multiprocessed version. I already was able to get it running with multiple threads. However, since threads (in python) are only useful for (network) IO-bounded tasks and not compute bound tasks, I also want to be able to use multiprocessing.
I currently plan to make a library with that functionality, of course I will share it when it's ready. :)
Beta Was this translation helpful? Give feedback.
All reactions