Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill gaps limited 7665 #9402

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

Ockenfuss
Copy link
Contributor

@Ockenfuss Ockenfuss commented Aug 23, 2024

Improve gap filling

This PR involves a full implementation, documentation and corresponding tests. As discussed in #7665, a new API function fill_gaps was introduced, to avoid breaking changes of existing code and keep the argument inflation of the filling functions to a minimum.

TBD if accepted:

  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst
  • Add Example in Documentation

Deprecations

  • Using limit in interpolate_na now requires numbagg or bottleneck. So far, limit worked without bottleneck, max_gap required it (not documented). Reason: max_gap and limit now share the same codebase using ffill. Since no one ever complained about the missing documentation and the error message is pretty clear and easy to resolve ("No module named 'bottleneck"), I hope this might be acceptable.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems extremely good!!

I left some refine-y comments, nothing preclusive.

What do others think? I don't use interpolate_na myself much so other may have views...

xarray/core/dataarray.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
| datetime.timedelta
) = None,
limit_direction: LimitDirectionOptions = "both",
limit_area: LimitAreaOptions | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refine-y but:

  • are interpolate and extrapolate clearer options?
  • is there a better name for the param? I'm not sure. "area" sounds 2D to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I adopted this nomenclature from pandas. What about limit_region, does this sound less 2D?
Personally, I like the 'inside'/'outside' options, 'interpolate' sounds close to linear interpolation to me...

xarray/core/dataset.py Outdated Show resolved Hide resolved
@@ -3766,6 +3769,160 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self:

return bfill(self, dim, limit=limit)

def fill_gaps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts from others on the naming? Would .fill be insufficiently specific that it's filling na? Would fill_missing be clearer than fill_gaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to any of those. .fill sounds very concise, but maybe this is easily confused with .ffill

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .fill could be quite nice, do others have a view?

Copy link
Contributor

@dcherian dcherian Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe gap_filler instead, since this method does not actually fill the gaps.

I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later

mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)

Copy link
Contributor Author

@Ockenfuss Ockenfuss Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Just to explain:

gap_filler emphasizes the returned object type nicely! However, I choose fill_gaps because it fits the naming scheme of other object-returning functions better (e.g. rolling and coarsen are not called roller and coarser in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler, I will change accordingly.

The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~). If the mask is required, you can easily get it from the object:

mask=ds.gap_filler(...).mask

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all the new methods are missing return types, please consider adding them.

The new class should be generic, I have added a few review comments here and there but more adoptions are needed.

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
@Ockenfuss
Copy link
Contributor Author

Basically all the new methods are missing return types, please consider adding them.

The new class should be generic, I have added a few review comments here and there but more adoptions are needed.

Thanks a lot for the review of the typing! I`ll work through them asap and try to make all adoptions (probabily still need some help afterwards, though ...)

xarray/core/dataarray.py Outdated Show resolved Hide resolved
@Ockenfuss
Copy link
Contributor Author

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

@keewis keewis linked an issue Aug 26, 2024 that may be closed by this pull request
@headtr1ck
Copy link
Collaborator

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

I'm currently on holiday and it's difficult to debug these errors on the phone.
I can do it in 1-2 weeks or someone else can help out?

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

max-sixty commented Sep 10, 2024

There are a couple of outstanding questions about the interfaces; in particular:

Re typing — are the interfaces typed correctly? I think they are. Assuming those are correct, I would vote that it's fine to add ignores internally (ofc the internal checks are useful tests for the external interfaces, so ideally we'd fix those)

What's the best way of resolving those outstanding questions @pydata/xarray ? I know I've been terrible on attending calls now I'm on PT, but maybe we could do it on one of those? (would be great to minimize the delay on these sorts of PRs — I think they're really valuable and would love to engender more...)

@Ockenfuss
Copy link
Contributor Author

Ockenfuss commented Sep 11, 2024

One more naming decision:

  • Fill gaps limited 7665 #9402 (comment)
    I am on conference this week, but mostly available the next two weeks, if anyone wants to discuss. Otherwise, I will join the next dev meeting on Sep. 25.

@max-sixty
Copy link
Collaborator

Did you manage to discuss this this morning?

@kmuehlbauer
Copy link
Contributor

Did you manage to discuss this this morning?

@max-sixty Yes, @Ockenfuss explained what decisions are to be made and encouraged to have a look here.

@Ockenfuss
Copy link
Contributor Author

Hi @max-sixty, not too much community feedback yet :) What do you think, shall we have another look on the three open naming issues and then push this over the finish line?
I still think this is quite useful functionality for all experimentalists, where gaps of all sizes are common in measurement data.

@max-sixty
Copy link
Collaborator

What do you think, shall we have another look on the three open naming issues and then push this over the finish line?

Yes let's do it! I would bias for action; if folks disagree then dissent welcome.

What's remaining? Personally I think we should simplify this: #9402 (comment) + resolve any names. The external types should be correct; internally some ignores are OK...

@Ockenfuss Ockenfuss force-pushed the fill_gaps_limited_7665 branch from b5025ff to 72c76db Compare January 21, 2025 17:55
@Ockenfuss
Copy link
Contributor Author

Ok, I think from my side this is ready for merge. I stayed with the fill_gaps and limit_area names for now. Check out the doc build pages for fill_gaps and the GapMask object to get a feeling for it and let me know if it feels right :)

This PR leaves all existing fill function API untouched. If we want to bring some of the new limit arguments to the existing functions (like DataArray.fillna(limit=...)), this could be done later in separate PRs.

TBD:

  • Update Whatsnew

@max-sixty
Copy link
Collaborator

Looks great, I think this is indeed very close to landing.

The one thing I'm stuck on is the concept of fill_direction="backward" ... .ffill... — can we explain what the purpose of this is? My mental model is that we're filling missing data from an edge in a single direction; I don't understand how what's going on to fill backwards in a forward fill.

(Totally fine if others get this, I'm not the gatekeeper here, but do think it's important the API makes sense...)

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
@Ockenfuss
Copy link
Contributor Author

Ockenfuss commented Jan 23, 2025

See this comment for an explanation, or just try out what happens :) For example:

da=xr.DataArray([1, np.nan, np.nan,  2], dims=['x'])
da.fill_gaps('x', limit=1, limit_direction='backward').ffill()
<xarray.DataArray (x: 4)> Size: 32B
array([ 1., nan,  1.,  2.])
Dimensions without coordinates: x

I agree, this is maybe not obvious. Here, I explained why I left it as is. If you want, I can try to keep track of the limit_direction argument upon mask creation and raise an exception, if the user applies ffill later. But this requires some tweaks in the code.

Edit: I added a sentence to the fill_gaps doc page to explain the behaviour better.

@max-sixty
Copy link
Collaborator

Yes, I see what it does, but (very very respectfully, with much admiration for the overall PR!), it doesn't make semantic sense to me.

When I think "backward", I mean "look behind you, is there a gap? OK now fill it". I don't see how it can mean "start one step back, and then decide which direction to continue". Why only one step?

I think we should try hard to have xarray's interfaces be understandable without much technical knowledge. (Indeed, I think there's probably a more precise way of describing my objection around edges and continuous vs. discrete, but hopefully this suffices).

Would you object to removing the limit_direction param, and deferring that to the method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants