-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functions or accessors for serialization API? #389
Comments
FYI they don't really add state in a reliable way anyway, see pydata/xarray#3268. Accessors really should only be thought of as a way to register a method on an xarray class behind a dedicated namespace.
These are annoying I agree, but the arguments in favour of using the accessor (as opposed to
I personally think that these advantages are more important than the lack of type-checking and auto-completion. Also note that I do kind of want to change the name of the accessor (#241). |
The lacking code completion (at least when that is not built on top of type hints) is a bug in I'm no expert on type hinting, but typing accessors does indeed look tricky, since these are dynamically attached to |
Yeah I was also wondering this, if we're able to provide dynamic auto-completion for the dataset key |
See pydata/xarray#9985: this is pretty easy to get to work |
I agree that the statefulness is unreliable, but it is still one of the main reasons developers register an accessor based on experience from PyGMT and watching the development of GeoXarray. It is even given as the second reason in the xarray docs - https://docs.xarray.dev/en/stable/internals/extending-xarray.html#writing-custom-accessors. Given the frustrations many projects have found with the unreliability of statefulness via accessors, it may be worth not including as a motivation or even explicitly discouraging in the xarray docs with an alternative recommendation for how to build stateful extensions. |
Totally fair - please raise an issue to suggest that on Xarray (or comment on the linked issue). Those criticisms of stateful accessors are unrelated to us using them in VirtualiZarr though - VirtualiZarr is a perfect use case for them in that we have no need to cache state. |
@TomNicholas, thanks for starting this discussion. Here's my 2 cents, but since I'm a total noob to this xarray, zarr, etc. world, please set me straight if you see places where I may be misunderstanding or missing knowledge or context about any of this.
Thanks for this link. That's a good discussion and seems to be a clear indicator that we should not be making stateful accessors, except for cases where we're simply aiding performance (and do not rely on an accessor being copied when its owning object is copied).
I certainly understand the general desire to make things "familiar," but I'm wondering if this goal is actually contributing to some of the confusion mentioned in the issue you linked. Currently, as I see it, this correlates to duck typing: If it looks like a duck, you expect it to act like a duck, but if it doesn't act like a duck, then you may be left wondering why it looks like one. Of course, this doesn't mean things need to look completely unrecognizable, but there should perhaps be enough difference in appearance to match the difference in behavior, to eliminate (or at least significantly reduce) the chances of confusion.
I agree that being able to tie into method chaining is a worthy goal, but I'd argue that it only really make sense for "non-terminal" operations. In other words, all of these Using your example, if I want to do something with the dataset (or simply return it out of a function) as well as write it to zarr, I'm not going to end my chain with [NOTE: I realize that I'm blurring the distinction between existing methods and accessor methods (with their backing functions), but I get more specific farther below]
In this case, we might as well simply have a Alternatively, to make such
Therefore, regarding the need for accessors, I would argue that this is a false sense of need when we're not keeping state, and in fact causes us to write and maintain code that we don't need to (i.e., the accessor code), particularly because the accessors simply pass through to backing functions. We can instead eliminate such accessors and use the backing functions directly via the The Continuing from the example above, we can do this, if we don't want to chain:
OR, if you really want to chain:
Further, if you modify
See above regarding the
Using our Here are additional points that I find problematic with custom accessors:
|
Whoa what a good write up @chuckwondo! The combination of arbitrary data variable access on xarray datasets and accessors in particular really strikes a cord with me. But the legacy of pandas is that people are really really used to methods. Functions feel harder to discover and use. The fact that the docs and code snippets so often use |
Thanks. I'm glad you appreciate it. I was a bit hesitant to get on my development "soap box" so early in my entry into this community. I don't want to immediately wear out the warm welcome I've received.
This is certainly a tricky balance. Generally speaking (not only for what we're doing here), on the one hand, we want to do what we can to develop code in a way that feels familiar to folks. This alone can be tricky because people come with widely varying backgrounds, so nothing is ever going to feel familiar to everybody, but hopefully we can make things familiar to a significant swath of folks. On the other hand, when we start going through all sorts of contortions to achieve an overly high level of familiarity, we can end up with an overly contorted result that is perhaps worse than what we're trying to avoid. Yes, many people may very well be used to methods (IMHO, classes in Python are overused, but that's a separate soap box), but functions are certainly not alien (they are also familiar to folks), so going to great lengths to avoid functions at seemingly all costs so that everything is a method can become counterproductive, IMHO. I feel that putting such great effort into custom accessors can be counterproductive based upon the details I gave above. I suspect most folks also know that something they pull out of a box is not going to define every possible desired functionality in the methods provided out of the box, and that there will be additional libraries required (or even just your own code, never mind an additional library) to provide additional functionality, thus requiring the use of functions. I mean, if xarray or zarr already had everything built in, this repo wouldn't need to exist, would it? The various Regarding the "discoverability" of functions (or lack thereof relative to methods), one way to help mitigate this issue is by encouraging the use of (dedicated) namespaces, so that you can discover functions through their namespaces (via code completion) similar to how you can do so for methods. As far as I understand it, this is precisely the motivation for accessor registration. Unfortunately, this approach is fraught with the problems I enumerated in my previous comment (and I suspect others that haven't even crossed my mind yet). Therefore, we may perhaps achieve something that is similar enough to such accessors, but without their pitfalls, by defining dedicated namespaces. For example, instead of registering an accessor named "virtualize" on xarray.Dataset, perhaps we expose a namespace named So instead of doing something like this (following on from the earlier example): # Huh? What do I need to do to get the "virtualize" accessor registered?
# There is no explicit connect between this import and the magical "virtualize" accessor
import virtualizarr
# 1. potential for unresolvable namespace collision of "virtualize"
# 2. code completion is currently "broken", so not only is "virtualize" itself
# undiscoverable, so are all of its methods
# 3. one might argue that chaining an attribute onto a method chain in order
# to get access to another method is no more familiar than using `pipe` (see below)
ds.sel(...).drop_vars(...).virtualize.to_zarr() we might do something like so: from virtualizarr import virtualize
# 1. not possible for attribute name collisions
# 2. namespace collisions can be avoided via import aliasing
# 3. automatic discoverability of functions, as discoverability does not require an xarray fix
ds.sel(...).drop_vars(...).pipe(virtualize.to_zarr) or, to address what was mentioned earlier about perhaps wanting a "virtualize" accessor for xr.Dataset, xr.DataArray, and xr.DataTree, we can create a corresponding namespace for each, along these lines perhaps:
Aside from being able to avoid name collisions (i.e., we can alias
@TomNicholas linked to this very relevant issue discussing this very idea: VirtualDataset class instead of "virtual xr.Dataset", which may have slipped your mind, as you commented there several months ago :-) Subclassing can bring its own set of gnarly problems. |
this is not an issue if you use dictionary getitem syntax ( With that, the only thing that can clash would be third-party accessors and built-in methods / properties / accessors.
The only way we can resolve this currently is by loose ecosystem coordination, which has worked pretty well so far (I have not seen any complaints about colliding accessor names, yet, even though they have been around for at least 6 years). However, maybe we can create a page in the xarray docs that describe which names are already taken?
Technically it is possible to create aliases: if you import the accessor class directly and call from pint_xarray import PintDatasetAccessor
import xarray as xr
import cf_xarray.units
xr.register_dataset_accessor("units")(PintDatasetAccessor)
ds = xr.tutorial.open_dataset("air_temperature").units.quantify() The only issue with that is that the accessor class itself is usually not considered public API, so if someone decided to rename it the example would break.
indeed, and I'm using |
@keewis, thanks for chiming in. I appreciate the additional insights. You have addressed some of my points, but I believe some of your responses may have confirmed one of my other points:
[The remainder of this reply is greatly abridged from the original. I was going too far down the rabbit hole.] I suppose my basic premise is this: I find that accessor registration introduces many potential problems and potential for confusion/accidents that are not worth the small modicum of syntax familiarity they attempt to achieve, and it all boils down to their being implicit rather than explicit. In other words, they are not "opt in," they are only "opt out," but not easily/conveniently/completely so, which leaves the door open for the proverbial Mr. Murphy to enter. Therefore, what I'd like to do is find a means for "cleanly" (or relatively so) supporting both the use of an accessor as well as the use of functions with the Here's what I believe this would mean:
For example, the removal of the internal decorator might be replaced with this top-level, public function: def register_dataset_accessor(name: str):
# We would remove the decorator from the VirtualiZarrDatasetAccessor declaration itself
xr.register_dataset_accessor(name)(VirtualiZarrDatasetAccessor) and could be used like so: import virtualizarr
# This assumes virtualizarr defines only 1 Dataset accessor class, so
# users need only specify what name they want to use.
virtualizarr.register_dataset_accessor("virtualize") For those that want to use Perhaps something like so: # This contains the functions that can be used directly with `pipe`,
# and *only* such functions to keep this namespace from being otherwise cluttered.
import virtualizarr.dataset as VDS
ds = <expression creating a Dataset>
ds.sel(...).drop_vars(...).pipe(VDS.to_zarr) |
@chuckwondo, all of these criticisms are valid (and I agree with a several of them), but they are criticisms of a design pattern that is well-established in xarray (and as @jsignell says also previously pandas, and before that arguably in numpy). Like it or not this is a downstream library, so unless you change this in xarray upstream, there is some tension between breaking an established (anti-)pattern and presenting an unfamiliar API. On specifics:
I agree that
There is a difference: the accessor method is protected behind a dedicated namespace. It's good that you have to go out of your way to call The vds.sel(...).drop_vars(...).pipe(virtualizarr.dataset_to_icechunk, icechunkstore) is significantly more convenient than this syntax: vds.sel(...).drop_vars(...).virtualize.to_icechunk(icechunkstore) Ultimately from the CPython compiler's perspective there is no difference between an accessor method and a function anyway. As "everything is a class" you can alias the methods as functions like this: from virtualizarr.accessor import VirtualiZarrDatasetAccessor
dataset_to_icechunk = VirtualiZarrDatasetAccessor.to_icechunk
dataset_to_icechunk(vds, icechunkstore) This also would be an escape hatch for the (exceedingly rare) case of a name collision.
Neither this nor your other suggestions solve the type dispatch problem, but accessor methods solve it fairly neatly.
This is a problem for some libraries (e.g. pint-xarray), but not for virtualizarr, because you'll never be able to get a virtual dataset object without first using importing
Given that xarray/numpy/python's type hinting is not ready (there's an xarray issue for this somewhere...) to even represent the distinction between But as I said in #171 a Overall I feel that what we have now is a reasonable compromise (especially if we can get accessor auto-completion to work via pydata/xarray#9985). |
Not at all, I'm pleased to see the thoughtful and constructive engagement! However, I must say I feel like this question in particular has become a mountain from a molehill. It really doesn't matter very much whether we use an accessor or a function: the differences are minimal, the preferred option debatable, the signatures equivalent, it doesn't enable new functionality, it doesn't present a significant backwards-compatibility risk (i.e. we can easily switch later), it has no complex interactions with other libraries, nor any bearing on later design decisions within this library. In contrast, we have a lot of urgent work to be doing around zarr-python v3 + kerchunk compatibility (#392), and after that we have a lot of finicky work to separate concerns with CF conventions (#157), and after that a lot of difficult spec work to generalize zarr's data model to support virtualizing a wider range of real-world datasets (see all these issues). With that context, I'm personally going to prioritise working on the other issues I just linked, but will leave this issue open so that anyone else who feels strongly about this can chime in to add weight behind either choice. |
Thanks so much @TomNicholas!
Indeed. The longer version of my reply (that I ended up significantly shortening) recognized this, and indicated that this would be a "problem" to be solved in those other libraries, if there were ever any desire to do so, and not something we would "solve" here. Apologies that I accidentally axed that completely in my shortened reply.
Just to be clear, I was indicating a possible name collision between the implicit accessor name ("virtualize" in our case) and other attributes, not between the methods of the accessor and other attributes. Of course, I don't suspect there's much chance that a data array would be named "virtualize," but just pointing out the possibility.
Just to clarify, I don't believe I was suggesting that the use of Rather, I was suggesting that the use of
Just for my own edification, can you clarify what you mean by this? I think what I suggested does solve this, but I suspect I'm on a different wavelength here, or I didn't explain myself well on this point.
Right, but either way, that's part of my concern. Since either way everything is implicit, it's hard to tell what actually triggers registration, so it's confusing to know how to either intentionally force registration or intentionally avoid registration. Again, I realize we can't solve that here and there's lots of precedent established in the other libraries you mentioned. I just find the implicit nature of accessors the crux of my frustration.
I absolutely agree with you here. I absolutely would not want to see a
Getting auto-completion to work would be a massive help. Would you be willing to make public the functions backing the accessor methods, so that those who wish to use Thanks so much for your thoughtful reply and your patience with my going down a rabbit hole. |
Originally posted by @chuckwondo in #341
The text was updated successfully, but these errors were encountered: