Replies: 2 comments 2 replies
-
I think |
Beta Was this translation helpful? Give feedback.
2 replies
-
We have put a solid amount of work into a de-attrification effort in this PR: #1660 I think the basic design we are using is sound, but I would appreciate some feedback from other developers. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Data modelling for
zarr-python
The zarr specifications define the properties of zarr metadata documents. In order to faithfully implement the specifications,
zarr-python
needs to read and write correct metadata documents, and reject incorrect metadata documents. In API terms, I think this means we need a class per metadata document, where the properties of that class map on to attributes described in the metadata document, with simple serialization to / from the native format of the metadata document (JSON). I think it's OK if, where it's helpful, the attributes of the metadata class are not themselves JSON serializable -- e.g., I think the v2array.dtype
should be represented as anp.dtype
object, even though it gets serialized to JSON as a string, but maybe this can be a discussion point.Accordingly,
zarr-python
should not represent data structures (array or groups) if those data structures cannot be serialized to spec-compliant metadata documents. Simple enough, but the current version ofzarr-python
does not address this challenge head-on. For example, it was recently possible to createzarr.Array
instances with irregular chunk sizes, even though this not valid according to the zarr v2 spec. So we have room for improvement in this area. That's the topic of this discussion.type- and value-level modelling for array and group objects
Runtime validation is necessary to ensure correctness at runtime, but type annotations are also important for keeping development friction low. I think we need both. Fortunately, the specs aren't too complicated, and they don't change much, so writing validation code and annotating types won't be a big burden.
python tools for data modelling
There are a variety of python libraries for data modelling:
dataclasses
,attrs
,marshmellow
,pydantic
, etc.zarrita
usesattrs
, and I have worked extensively withpydantic
in personal projects. In short, these libraries make it very easy to:Should add one of these libraries as a dependency for
zarr-python
in v3? Going withattrs
would be straightforward, since we are bootstrapping our v3 efforts with @normanrz's work inzarrita
, which usesattrs
. Others have proposed depending onpydantic
, anddataclasses
look good if we don't want external dependencies.my proposal: do it ourselves
I don't think we need a data modelling library for
zarr-python
. First, I don't think it's important for us to quickly create new data models, because the set of things we have to model inzarr-python
is largely static (i.e., the contents of the zarr specifications). By being on the other side of that tradeoff, we get increased flexibility. For the rest of the bullet points listed above, we can do that with vanilla, undecorated classes. Here is how I am currently approaching this in my v3 WIP branch (note that as of this writing that branch only contains an outline of this strategy):array
,group
) has typed attributes that structurally match a metadata document described in a zarr specification.{name: <>, config: <>}
style attributes in v3, or codecs in v2, each of these attributes is modeled the same way as the broader metadata document.to_dict
/from_dict
methods, andto_json
/from_json
methods. Because of the constraint described above, nestedto_dict
calls work by the nesting class callingto_dict
on its nested attributes. There are no other methods. These classes exist as data.TypedDict
class that defines the type of the return / accepted type ofto_dict
/from_dict
method.__setattr__
and__delattr__
shape
andchunks
attributes are consistent).I think if we use a strategy approximately like this, then we don't have to define our classes according to the rules of a particular data modelling library, but we expose an API that can be used as scaffolding for someone who does want to integrate zarr into a particular data modelling framework. For example, by making the parsing routines stand-alone functions,
pydantic
users can just import those functions to createpydantic
models for zarr.I would love to hear thoughts from others about this!
Beta Was this translation helpful? Give feedback.
All reactions