-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the memory bloat of a saved emulator #25
Comments
I've been thinking about this today, and I'm adding an option to downsample the points in the ECDF. For the drought experiment runs we have something like 500 samples in each grid cell, and that's a lot more than we really need. I think we could easily reduce that by a factor of 10 and still be getting good fidelity. This can be pretty easily implemented by adding a decimation function right before we construct the interpolator. Option 1A (before we resort to option 2), would be to stop storing the closures produced by Now, the closures returned by |
So, as a quick experiment, I took a full grid of residuals and compared the decimated (by a factor of 19 -- leaving about 50 samples per grid cell) empirical CDF and quantile functions to the fully-sampled version. I computed the mean and maximum differences in both output probability and quantile values for each grid cell. (Quantile differences were normalized by the mean absolute value of the input values; CDF probabilities are already on a scale from 0-1.) Then I took the mean and the max of all four indicators over the entire grid. Here's what I came up with:
The maximum differences for the quantiles are a little larger than I'm happy with, so I've got a test running now to see how it looks when I downsample the CDF to 100 samples per grid cell. That's still a factor of 10 savings, which should help a lot. |
Here's what the results of the 100-sample (-ish) test look like:
All of the stats are improved, except for the grid max of qmaxdiff, which is identical (and, unsurprisingly, occurs in the same grid cell). I'm still not sure whether this is worth being concerned about. I'm going to implement this version and check how much difference it makes in the output residual fields. |
Ok, I've spent way more time on this than it's probably worth, but I tl;dr: When applying fldgen to these inputs (ISIMIP half-degree The tables below give a detailed comparison of memory usage in two Total emulator size (calculated with
Here's the breakdown by component:
The
The
So, we could get a quick win out of eliminating this component, The other big memory users are the
Each one of these is a list of 67420 closures, and each one of
It's not clear to me whether the closure's environment and the Note also that the function formals and body are taking up a little So, what have we learned here? Reducing the resolution of the CDF and Finally, storing the CDF and quantile functions as lists of closures I still need to do some more testing on the equivalence between the |
Really interesting--thanks for the detail. |
Intermediate fix provided by the introduction of the |
Option 1:
Use something other than approx_fun to characterize the empirical CDF for each grid cell with fewer variables in R/normalizeresiduals.R #14
Option 2:
If Option 1 doesn't bring down the size of a saved emulator by enough, cry and think of something else
The text was updated successfully, but these errors were encountered: