Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pint unit conversion loads entire file to memory #138

Closed
tlogan2000 opened this issue Dec 21, 2018 · 13 comments · Fixed by #156
Closed

Pint unit conversion loads entire file to memory #138

tlogan2000 opened this issue Dec 21, 2018 · 13 comments · Fixed by #156
Assignees
Labels
bug Something isn't working
Milestone

Comments

@tlogan2000
Copy link
Collaborator

tlogan2000 commented Dec 21, 2018

Description

Running calculations via index classes (temperture.py and precip.py) @davidcaron noticed that pint library will load the entire file to memory doing the conversion. This is a big problem on large data sets (e.g. pcic data)

We will need to find a way to either keep the conversion via pint as a dask.delayed calculation on the chunked dataarray or code a simple version of 'common' conversion that we see the most often.

@huard
Copy link
Collaborator

huard commented Dec 21, 2018

I seem to recall that it didn't work out of the box and I had to do a copy. I'll see how I can fix that. Urgent ?

@tlogan2000
Copy link
Collaborator Author

Not urgent really. I am doing conversion by hand before calling the classes in order to bypass the conversion step....

@huard
Copy link
Collaborator

huard commented Jan 21, 2019

How would you test that that a fix actually works ?

@tlogan2000
Copy link
Collaborator Author

Not totally sure actually. @davidcaron do you think you can you pinpoint for us the bit of code where pint loads the values to memory during conversion? I think basically we want to make sure that the input datarray is not called or accessed with da.load(), da.values(), da.compute() or similar ensuring that it remains a delayed dask array?

@huard
Copy link
Collaborator

huard commented Jan 21, 2019

I know where it is, it's on line 406 of utils.py. There is a da.copy(). The fix would be to write da.copy(deep=False) instead, but I don't know how to test except empirically.

@tlogan2000
Copy link
Collaborator Author

tlogan2000 commented Jan 21, 2019

I think da.values (lines 408 and 410 ) will load the data as well (essentially accessing as a numpy array?)
Not sure about testing this other than manually timing / or checking memory usage?

would this work? probably too good to be true:

            if context:
                b = (da * fu).to(tu, context)
            else:
                b = (da * fu).to(tu)

@huard
Copy link
Collaborator

huard commented Jan 21, 2019

Yes, but will it load the entire array or just the chunk it needs to ? That's what MetPy seems to do.

@huard
Copy link
Collaborator

huard commented Jan 21, 2019

I think you're right, it will load the entire array in memory...

@huard
Copy link
Collaborator

huard commented Jan 21, 2019

Posted a question on the MetPy issue tracker : Unidata/MetPy#996

Another option is to convert the threshold to the units used by the DataArray.

@huard
Copy link
Collaborator

huard commented Jan 21, 2019

Branch fix-138 has a test now to check whether the unit conversion is delayed or not, so we can at least get rapid feedback on whether something works or not.

@davidcaron
Copy link
Contributor

davidcaron commented Jan 21, 2019

@tlogan2000 , yes da.values copies the DataArray into a new numpy array, and da.copy() is deep by default. I'm not familiar with DataArrays, but the is probaply a way to scale the values without copying them...

A quick way to see if the data is copied is to change a value and check if it's changed in the other object.

I wrote this quick script to explore scaling DataArrays without copying in memory. You would need to add pint's context, which I'm not familiar with, and make sure to keep in mind that this function overwrites the DataArray given in input

import xarray as xr
import numpy as np
import pint


units = pint.UnitRegistry()


def convert_units(da):
    """Return DataArray converted to unit."""
    fu = units.meter
    tu = units.cm

    if fu != tu:
        da *= float(fu / tu)

    return da


if __name__ == '__main__':
    data_array = xr.DataArray(np.random.randn(2, 3))
    print('\n\n before \n')
    print(data_array)

    converted = convert_units(data_array)

    print('\n\n after \n')
    print(data_array)
    print(converted)

    assert np.allclose(data_array, converted)

Output:

 before 

<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 1.056166,  0.86343 , -2.041906],
       [ 1.287696, -0.405048, -0.666463]])
Dimensions without coordinates: dim_0, dim_1


 after 

<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 105.616567,   86.343013, -204.1906  ],
       [ 128.769579,  -40.504759,  -66.646318]])
Dimensions without coordinates: dim_0, dim_1
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 105.616567,   86.343013, -204.1906  ],
       [ 128.769579,  -40.504759,  -66.646318]])
Dimensions without coordinates: dim_0, dim_1

I'm probably overlooking a lot of things, I will look further when I get some time.

@huard
Copy link
Collaborator

huard commented Jan 21, 2019

That won't work for Fahrenheit, which includes both a multiplication and an addition, but maybe we can just say we don't support F.

@huard
Copy link
Collaborator

huard commented Feb 6, 2019

The to method of pint quantity utlimately calls:

with self._REGISTRY.context(*contexts, **ctx_kwargs):
    return self._REGISTRY.convert(self._magnitude, self._units, other)

So I think we can wrap that in a function and call that from an apply statement.

@Zeitsperre Zeitsperre added the bug Something isn't working label Feb 7, 2019
@Zeitsperre Zeitsperre added this to the xclim v1.0 milestone Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants