Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review how missing data is handled in inversions #97

Open
brendan-m-murphy opened this issue Apr 9, 2024 · 7 comments
Open

Review how missing data is handled in inversions #97

brendan-m-murphy opened this issue Apr 9, 2024 · 7 comments

Comments

@brendan-m-murphy
Copy link
Contributor

brendan-m-murphy commented Apr 9, 2024

If no footprint or obs data is found for a site, that site is dropped. This works well for monthly inversions, but less so for annual inversions.

@joe-pitt had some runs where NaNs due to missing footprint data were propegated to H and caused a confusing pymc error.

Possible improvements could include:

  • adding options to fill or drop NaN values
  • modifying the code that drops sites when data isn't found so that it works for annual runs (might be more difficult?)

Also, we should investigate what happens if some obs data is missing. Typically footprint data won't be missing, but we often have missing obs, and the runs work despite this.

@joe-pitt
Copy link
Contributor

Quick update on this - if a month of fp data from the middle of the dataset is missing from the object store (e.g. you have April 2016 and June 2016, but you are missing May 2016) you get the confusing pymc error:
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!

If there is a month of fp data missing at the end of the dataset (e.g. MHD obs run to November 2023, but the MHD footprints only run to October 2023) you get a different error:

File "/user/home/zh21490/openghg_inversions/openghg_inversions/hbmcmc/inversionsetup.py", line 100, in addaveragingerror
    fp_all[site.upper()].mf_repeatability.values ** 2
ValueError: operands could not be broadcast together with shapes (870,) (1,943) 

In this case there are no nans in fp_all, but the timeseries is truncated to the end of the fp data, which causes these subsequent issues. I think the first step would be to raise a more informative error when either of these cases occur, e.g. "Some observations could not be matched to footprints - check that all footprints are loaded in object store". Even better if it can tell you which footprints are missing!

@brendan-m-murphy
Copy link
Contributor Author

If the error happens during "add averaging error", then that might be because the obs data is loaded again from the object store for this part, so probably obs were dropped when the ModelScenarios are made, and then we try to add some averaging error based on the full obs time series.

@joe-pitt
Copy link
Contributor

Yes I think that is what is happening. So maybe it would run if averaging error was turned off, but by default I think we probably want it to fail, or at the very least give us a strong warning that some of the obs were removed due to missing fps

@brendan-m-murphy
Copy link
Contributor Author

For now, you might just want to turn off "addaveragingerror", because when it loads the obs, it specifies an averaging period, so the averaging error of that data (over the same period) is zero... so I don't think it does anything. I haven't got around to fixing that, but there is an issue open #42

@joe-pitt
Copy link
Contributor

OK cool thanks! Although in my case I found the missing fps so I'm running fine now

@qq23840
Copy link
Contributor

qq23840 commented Apr 19, 2024

I'm having this issue as well @brendan-m-murphy : it looks like when I have times that have footprints/no obs, no problem but if there's obs/no footprints I get the same issue as joe (pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!). Ideally we'd want a way of skipping the timestamps without fp in the inversion (presuming this is what happens when there are no obs?). Any idea where that would be best?

Looks related to #96 with the logp -inf issue, as well

@brendan-m-murphy
Copy link
Contributor Author

It should be possible to fix this using 'dropna("time")' on the model scenarios before they're converted to numpy. I'll have a look Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants