-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to zarrs 0.20 #87
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,8 @@ use rayon_iter_concurrent_limit::iter_concurrent_limit; | |
use unsafe_cell_slice::UnsafeCellSlice; | ||
use zarrs::array::codec::{ArrayToBytesCodecTraits, CodecOptions, CodecOptionsBuilder}; | ||
use zarrs::array::{ | ||
copy_fill_value_into, update_array_bytes, ArrayBytes, ArraySize, CodecChain, FillValue, | ||
copy_fill_value_into, update_array_bytes, ArrayBytes, ArrayBytesFixedDisjointView, ArraySize, | ||
CodecChain, FillValue, | ||
}; | ||
use zarrs::array_subset::ArraySubset; | ||
use zarrs::metadata::v3::MetadataV3; | ||
|
@@ -107,7 +108,7 @@ impl CodecPipelineImpl { | |
codec_options: &CodecOptions, | ||
) -> PyResult<()> { | ||
let array_shape = item.representation().shape_u64(); | ||
if !chunk_subset.inbounds(&array_shape) { | ||
if !chunk_subset.inbounds_shape(&array_shape) { | ||
return Err(PyErr::new::<PyValueError, _>(format!( | ||
"chunk subset ({chunk_subset}) is out of bounds for array shape ({array_shape:?})" | ||
))); | ||
|
@@ -127,20 +128,14 @@ impl CodecPipelineImpl { | |
let chunk_bytes_old = self.retrieve_chunk_bytes(item, codec_chain, codec_options)?; | ||
|
||
// Update the chunk | ||
let chunk_bytes_new = unsafe { | ||
// SAFETY: | ||
// - chunk_bytes_old is compatible with the chunk shape and data type size (validated on decoding) | ||
// - chunk_subset is compatible with chunk_subset_bytes and the data type size (validated above) | ||
// - chunk_subset is within the bounds of the chunk shape (validated above) | ||
// - output bytes and output subset bytes are compatible (same data type) | ||
update_array_bytes( | ||
chunk_bytes_old, | ||
&array_shape, | ||
chunk_subset, | ||
&chunk_subset_bytes, | ||
data_type_size, | ||
) | ||
}; | ||
let chunk_bytes_new = update_array_bytes( | ||
chunk_bytes_old, | ||
&array_shape, | ||
chunk_subset, | ||
&chunk_subset_bytes, | ||
data_type_size, | ||
) | ||
.map_py_err::<PyRuntimeError>()?; | ||
|
||
// Store the updated chunk | ||
self.store_chunk_bytes(item, codec_chain, chunk_bytes_new, codec_options) | ||
|
@@ -270,42 +265,50 @@ impl CodecPipelineImpl { | |
// For variable length data types, need a codepath with non `_into` methods. | ||
// Collect all the subsets and copy into value on the Python side? | ||
let update_chunk_subset = |item: chunk_item::WithSubset| { | ||
let chunk_item::WithSubset { | ||
item, | ||
subset, | ||
chunk_subset, | ||
} = item; | ||
let mut output_view = unsafe { | ||
// TODO: Is the following correct? | ||
// can we guarantee that when this function is called from Python with arbitrary arguments? | ||
// SAFETY: chunks represent disjoint array subsets | ||
ArrayBytesFixedDisjointView::new( | ||
output, | ||
// TODO: why is data_type in `item`, it should be derived from `output`, no? | ||
item.representation() | ||
.data_type() | ||
.fixed_size() | ||
.ok_or("variable length data type not supported") | ||
.map_py_err::<PyTypeError>()?, | ||
Comment on lines
+279
to
+284
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another question: Each individual item having its own data type makes no sense. We should probably pass in the data type only once. If we can rely on the output having the correct one, that would be easy, otherwise, we could make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #90 |
||
&output_shape, | ||
subset, | ||
) | ||
.map_py_err::<PyRuntimeError>()? | ||
}; | ||
|
||
// See zarrs::array::Array::retrieve_chunk_subset_into | ||
if item.chunk_subset.start().iter().all(|&o| o == 0) | ||
&& item.chunk_subset.shape() == item.representation().shape_u64() | ||
if chunk_subset.start().iter().all(|&o| o == 0) | ||
&& chunk_subset.shape() == item.representation().shape_u64() | ||
{ | ||
// See zarrs::array::Array::retrieve_chunk_into | ||
if let Some(chunk_encoded) = self.stores.get(&item)? { | ||
// Decode the encoded data into the output buffer | ||
let chunk_encoded: Vec<u8> = chunk_encoded.into(); | ||
unsafe { | ||
// SAFETY: | ||
// - output is an array with output_shape elements of the item.representation data type, | ||
// - item.subset is within the bounds of output_shape. | ||
self.codec_chain.decode_into( | ||
Cow::Owned(chunk_encoded), | ||
item.representation(), | ||
&output, | ||
&output_shape, | ||
&item.subset, | ||
&codec_options, | ||
) | ||
} | ||
self.codec_chain.decode_into( | ||
Cow::Owned(chunk_encoded), | ||
item.representation(), | ||
&mut output_view, | ||
&codec_options, | ||
) | ||
} else { | ||
// The chunk is missing, write the fill value | ||
unsafe { | ||
// SAFETY: | ||
// - data type and fill value are confirmed to be compatible when the ChunkRepresentation is created, | ||
// - output is an array with output_shape elements of the item.representation data type, | ||
// - item.subset is within the bounds of output_shape. | ||
copy_fill_value_into( | ||
item.representation().data_type(), | ||
item.representation().fill_value(), | ||
&output, | ||
&output_shape, | ||
&item.subset, | ||
) | ||
} | ||
copy_fill_value_into( | ||
item.representation().data_type(), | ||
item.representation().fill_value(), | ||
&mut output_view, | ||
) | ||
} | ||
} else { | ||
let input_handle = Arc::new(self.stores.decoder(&item)?); | ||
|
@@ -314,19 +317,11 @@ impl CodecPipelineImpl { | |
.clone() | ||
.partial_decoder(input_handle, item.representation(), &codec_options) | ||
.map_py_err::<PyValueError>()?; | ||
unsafe { | ||
// SAFETY: | ||
// - output is an array with output_shape elements of the item.representation data type, | ||
// - item.subset is within the bounds of output_shape. | ||
// - item.chunk_subset has the same number of elements as item.subset. | ||
partial_decoder.partial_decode_into( | ||
&item.chunk_subset, | ||
&output, | ||
&output_shape, | ||
&item.subset, | ||
&codec_options, | ||
) | ||
} | ||
partial_decoder.partial_decode_into( | ||
&chunk_subset, | ||
&mut output_view, | ||
&codec_options, | ||
) | ||
} | ||
.map_py_err::<PyValueError>() | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the big question: is it a good idea to blindly trust subsets coming from Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, one option would be to add a
_unsafe_skip_validation
parameter (defaultFalse
, obviously).When we call the function with chunks from the
zarr
Python lib, we set_unsafe_skip_validation=True
because we know we can trust zarr, but users that are tempted to use the CodecPipeline directly need to set it to get the speed boost of not validating the chunks.When they do set a parameter called
_unsafe_...
, it’s on them to use it correctly.But I don‘t think anyone should offer a regular Python API that can cause segfaults when used wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand you right, I think we explicitly say not to instantiate your own pipeline class or use it as an object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can't rely on the subsets being disjoint. zarr-developers/zarr-python#2851 (comment). Based on that comment, I suppose we would just have to iterate over overlapping subsets sequentially to match
numpy
-like behaviour.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
zarrs-python
is currently unsound, lovely! Good that we caught that. I added an issue: #89Should we first merge this PR (which would make fixing the issue easier) or will it take time until zarrs 0.20 is released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this true? The previous unsafe comments seem to be different than what's being discussed here. Also as Lachlan said in the issue, it's possible that
zarr-python
will fix this issue so that the safety assumption here would be correct.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I could get a release out soon, but maybe hold off on merging for now. Just in case we need any more hotfixes for
zarr-python
changes in the meantime.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless I misunderstood @LDeakin, because
which I interpreted as “the chunks coming from zarr aren‘t necessarily nonoverlapping”.
if that’s correct, our current behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, ok, understood. I remember our discussion in the kitchen a few weeks ago now. Got it.