-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pint unit conversion loads entire file to memory #138
Comments
I seem to recall that it didn't work out of the box and I had to do a copy. I'll see how I can fix that. Urgent ? |
Not urgent really. I am doing conversion by hand before calling the classes in order to bypass the conversion step.... |
How would you test that that a fix actually works ? |
Not totally sure actually. @davidcaron do you think you can you pinpoint for us the bit of code where pint loads the values to memory during conversion? I think basically we want to make sure that the input datarray is not called or accessed with da.load(), da.values(), da.compute() or similar ensuring that it remains a delayed dask array? |
I know where it is, it's on line 406 of utils.py. There is a |
I think da.values (lines 408 and 410 ) will load the data as well (essentially accessing as a numpy array?) would this work? probably too good to be true: if context:
b = (da * fu).to(tu, context)
else:
b = (da * fu).to(tu) |
Yes, but will it load the entire array or just the chunk it needs to ? That's what MetPy seems to do. |
I think you're right, it will load the entire array in memory... |
Posted a question on the MetPy issue tracker : Unidata/MetPy#996 Another option is to convert the threshold to the units used by the |
Branch fix-138 has a test now to check whether the unit conversion is delayed or not, so we can at least get rapid feedback on whether something works or not. |
@tlogan2000 , yes A quick way to see if the data is copied is to change a value and check if it's changed in the other object. I wrote this quick script to explore scaling DataArrays without copying in memory. You would need to add import xarray as xr
import numpy as np
import pint
units = pint.UnitRegistry()
def convert_units(da):
"""Return DataArray converted to unit."""
fu = units.meter
tu = units.cm
if fu != tu:
da *= float(fu / tu)
return da
if __name__ == '__main__':
data_array = xr.DataArray(np.random.randn(2, 3))
print('\n\n before \n')
print(data_array)
converted = convert_units(data_array)
print('\n\n after \n')
print(data_array)
print(converted)
assert np.allclose(data_array, converted) Output:
I'm probably overlooking a lot of things, I will look further when I get some time. |
That won't work for Fahrenheit, which includes both a multiplication and an addition, but maybe we can just say we don't support F. |
The
So I think we can wrap that in a function and call that from an apply statement. |
Description
Running calculations via index classes (temperture.py and precip.py) @davidcaron noticed that pint library will load the entire file to memory doing the conversion. This is a big problem on large data sets (e.g. pcic data)
We will need to find a way to either keep the conversion via pint as a dask.delayed calculation on the chunked dataarray or code a simple version of 'common' conversion that we see the most often.
The text was updated successfully, but these errors were encountered: