Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to zarrs 0.20 #87

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Update to zarrs 0.20 #87

wants to merge 1 commit into from

Conversation

flying-sheep
Copy link
Collaborator

No description provided.

Comment on lines +274 to +276
// TODO: Is the following correct?
// can we guarantee that when this function is called from Python with arbitrary arguments?
// SAFETY: chunks represent disjoint array subsets
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the big question: is it a good idea to blindly trust subsets coming from Python?

Copy link
Collaborator Author

@flying-sheep flying-sheep Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, one option would be to add a _unsafe_skip_validation parameter (default False, obviously).

When we call the function with chunks from the zarr Python lib, we set _unsafe_skip_validation=True because we know we can trust zarr, but users that are tempted to use the CodecPipeline directly need to set it to get the speed boost of not validating the chunks.

When they do set a parameter called _unsafe_..., it’s on them to use it correctly.

But I don‘t think anyone should offer a regular Python API that can cause segfaults when used wrong.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but users that are tempted to use the CodecPipeline

If I understand you right, I think we explicitly say not to instantiate your own pipeline class or use it as an object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the big question: is it a good idea to blindly trust subsets coming from Python?

Looks like we can't rely on the subsets being disjoint. zarr-developers/zarr-python#2851 (comment). Based on that comment, I suppose we would just have to iterate over overlapping subsets sequentially to match numpy-like behaviour.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So zarrs-python is currently unsound, lovely! Good that we caught that. I added an issue: #89

Should we first merge this PR (which would make fixing the issue easier) or will it take time until zarrs 0.20 is released?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So zarrs-python is currently unsound, lovely!

Why is this true? The previous unsafe comments seem to be different than what's being discussed here. Also as Lachlan said in the issue, it's possible that zarr-python will fix this issue so that the safety assumption here would be correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I could get a release out soon, but maybe hold off on merging for now. Just in case we need any more hotfixes for zarr-python changes in the meantime.

Copy link
Collaborator Author

@flying-sheep flying-sheep Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this true?

unless I misunderstood @LDeakin, because

Looks like we can't rely on the subsets being disjoint

which I interpreted as “the chunks coming from zarr aren‘t necessarily nonoverlapping”.

if that’s correct, our current behavior

  1. is unsound, as our parallel writers can end up simultaneously writing the same memory regions, which is UB
  2. even if we used fine-grained locking to avoid UB, we wouldn’t guarantee that the last data written is the one zarr expects us to write, so we’d still be wrong.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, ok, understood. I remember our discussion in the kitchen a few weeks ago now. Got it.

Comment on lines +279 to +284
// TODO: why is data_type in `item`, it should be derived from `output`, no?
item.representation()
.data_type()
.fixed_size()
.ok_or("variable length data type not supported")
.map_py_err::<PyTypeError>()?,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: Each individual item having its own data type makes no sense.

We should probably pass in the data type only once. If we can rely on the output having the correct one, that would be easy, otherwise, we could make chunk_descriptions a struct containing the dtype and the chunk items.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants