Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdba] Optimizations and fixes #791

Merged
merged 22 commits into from
Aug 26, 2021
Merged

[sdba] Optimizations and fixes #791

merged 22 commits into from
Aug 26, 2021

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented Aug 10, 2021

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • Tests for the changes have been added (for bug fixes / features)
  • Documentation has been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated (with summary of main changes)
  • bumpversion (major / minor / patch) has been called on this branch
  • Tags have been pushed (git push --tags)
  • The relevant author information has been added to .zenodo.json

What kind of change does this PR introduce?

  • Optimizes map_blocks when using coords backed by cftime.datetime objects. Dask needs some sort of hash for each input and that tokenization is quite slow for cftime np.ndarrays because it goes through pickle.dumps (see Performance : speeding up pickling of cftime arrays Unidata/cftime#253). Also, CFTimeIndex are slow to create because of a check that cycles through all elements of the array. With many chunks and large time dimensions, this takes a significant portion of the call time (excluding the actual computation). Here we add an option to encode cftime coordinate into integers (+ attributes), like in a netCDF, saving time on the cftime handling. The decoding is done within the task itself (thus in parallel).
  • Optimizes adapt_freq to skip computation on all-nan blocks.
  • Fixes nbu.vecquantile when rnk is NaN.
  • Merges all step of DQM.adjust in a single map-block-wrapped function. Once again to reduce the number of tasks in dask's tree. However, the adjust tasks as large and will trigger warnings about garbage collection taking a lot of cpu time. I haven't found any optimization for this part yet.
  • Better dtype conservation. Fixes at many places, including in map_blocks to set the template variables to the largest dtype of the inputs.
  • Modify map_blocks and map_groups to consider dimensions in Grouper.add_dims. By default, functions wrapped with map_groups will assume Grouper.DIM AND Grouper.ADD_DIMS are all reduced, unless main_only=True is passed.
  • "constant" Extrapolation of adjustment factors in qm_adjust is now using values just over and under the max and min of the target data, instead of ±np.inf. There was a bug I had while adjusting precipitation in ESPO-R that was solved by this. I'm not completly sure why...

Does this PR introduce a breaking change?

I think not, but this PR was so iterative over a long time, I'm not 100% sure.

Other information:

All those optimizations and fixes were done while processing "scénarios génériques v2". Except the add_dims thing.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@aulemahal aulemahal marked this pull request as ready for review August 26, 2021 17:43
@aulemahal aulemahal requested a review from huard August 26, 2021 17:47
@aulemahal aulemahal added this to the v0.29 milestone Aug 26, 2021
@aulemahal aulemahal merged commit a6c82be into master Aug 26, 2021
@aulemahal aulemahal deleted the sdba-opt-and-fixes-sg2 branch August 26, 2021 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Usage question: How to use multiple members as samples in sdba?
3 participants