Skip to content

Commit

Permalink
Merge pull request #229 from holukas/indev
Browse files Browse the repository at this point in the history
Indev
  • Loading branch information
holukas authored Oct 4, 2024
2 parents 42f3828 + bfe1d7a commit d6e0481
Show file tree
Hide file tree
Showing 20 changed files with 7,366 additions and 2,499 deletions.
76 changes: 76 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,76 @@

![DIIVE](images/logo_diive1_256px.png)

## v0.83.0 | 4 Oct 2024

## MDS gap-filling

Finally it is possible to use the `MDS` (`marginal distribution sampling`) gap-filling method in `diive`. This method is
the current default and widely used gap-filling method for eddy covariance ecosystem fluxes. For a detailed description
of the method see Reichstein et al. (2005) and Pastorello et al. (2020; full references given below).

The implementation of `MDS` in `diive` (`FluxMDS`) follows the description in Reichstein et al. (2005) and should
therefore yield results similar to other implementations of this algorithm. `FluxMDS` can also easily output model
scores, such as r2 and error values.

At the moment it is not yet possible to use `FluxMDS` in the flux processing chain, but during the preparation of this
update the flux processing chain code was already refactored and prepared to include `FluxMDS` in one of the next
updates.

At the moment, `FluxMDS` is specifically tailored to gap-fill ecosystem fluxes, a more general implementation (e.g., to
gap-fill meteorological data) will follow.

## New features

- Added new gap-filling class `FluxMDS`:
- `MDS` stands for `marginal distribution sampling`. The method uses a time window to first identify meteorological
conditions (short-wave incoming radiation, air temperature and VPD) similar to those when the missing data
occurred. Gaps are then filled with the mean flux in the time window.
- `FluxMDS` cannot be used in the flux processing chain, but will be implemented soon.
- (`diive.pkgs.gapfilling.mds.FluxMDS`)

### Changes

- **Storage correction**: By default, values missing in the storage term are now filled with a rolling mean in an
expanding
time window. Testing showed that the (single point) storage term is missing for between 2-3% of the data, which I
think is reason enough to make filling these gaps the default option. Previously, it was optional to fill the gaps
using random forest, however, results were not great since only the timestamp info was used as model features. Plots
generated during Level-3.1 were also updated, now better showing the storage terms (gap-filled and non-gap-filled) and
the flag indicating filled values (
`diive.pkgs.fluxprocessingchain.level31_storagecorrection.FluxStorageCorrectionSinglePointEddyPro`)

### Notebooks

- Added notebook example for `FluxMDS` (`notebooks/GapFilling/FluxMDSGapFilling.ipynb`)

### Tests

- Added test case for `FluxMDS` (`tests.test_gapfilling.TestGapFilling.test_fluxmds`)
- 50/50 unittests ran successfully

### Bugfixes

- Fixed bug: overall quality flag `QCF` was not created correctly for the different USTAR scenarios (
`diive.core.base.identify.identify_flagcols`) (`diive.pkgs.qaqc.qcf.FlagQCF`)
- Fixed bug: calculation of `QCF` flag sums is now strictly done on flag columns. Before, sums were calculated across
all columns in the flags dataframe, which resulted in erroneous overall flags after USTAR filtering (
`diive.pkgs.qaqc.qcf.FlagQCF._calculate_flagsums`)

### Environment

- Added [polars](https://pola.rs/)

### References

- Pastorello, G. et al. (2020). The FLUXNET2015 dataset and the ONEFlux processing pipeline
for eddy covariance data. 27. https://doi.org/10.1038/s41597-020-0534-3
- Reichstein, M., Falge, E., Baldocchi, D., Papale, D., Aubinet, M., Berbigier, P., Bernhofer, C., Buchmann, N.,
Gilmanov, T., Granier, A., Grunwald, T., Havrankova, K., Ilvesniemi, H., Janous, D., Knohl, A., Laurila, T., Lohila,
A., Loustau, D., Matteucci, G., … Valentini, R. (2005). On the separation of net ecosystem exchange into assimilation
and ecosystem respiration: Review and improved algorithm. Global Change Biology, 11(9),
1424–1439. https://doi.org/10.1111/j.1365-2486.2005.001002.x

## v0.82.1 | 22 Sep 2024

## Notebooks
Expand Down Expand Up @@ -2253,3 +2323,9 @@ which allows the calculation of the flux detection limit following Langford et a
571–583. https://doi.org/10.5194/bg-3-571-2006
- Pastorello, G. et al. (2020). The FLUXNET2015 dataset and the ONEFlux processing pipeline
for eddy covariance data. 27. https://doi.org/10.1038/s41597-020-0534-3
- Reichstein, M., Falge, E., Baldocchi, D., Papale, D., Aubinet, M., Berbigier, P., Bernhofer, C., Buchmann, N.,
Gilmanov, T., Granier, A., Grunwald, T., Havrankova, K., Ilvesniemi, H., Janous, D., Knohl, A., Laurila, T., Lohila,
A., Loustau, D., Matteucci, G., … Valentini, R. (2005). On the separation of net ecosystem exchange into assimilation
and ecosystem respiration: Review and improved algorithm. Global Change Biology, 11(9),
1424–1439. https://doi.org/10.1111/j.1365-2486.2005.001002.x

1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ _Fill gaps in time series with various methods._
- **Long-term gap-filling using RandomForestTS** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/LongTermRandomForestGapFilling.ipynb))
- **Linear interpolation** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/LinearInterpolation.ipynb))
- **Quick random forest gap-filling** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/QuickRandomForestGapFilling.ipynb))
- **MDS gap-filling of ecosystem fluxes** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/FluxMDSGapFilling.ipynb)), approach by [Reichstein et al., 2005](https://onlinelibrary.wiley.com/doi/10.1111/j.1365-2486.2005.001002.x)

### Outlier Detection

Expand Down
8 changes: 7 additions & 1 deletion diive/core/base/identify.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def identify_relevants(seriescol: str) -> list:
return relevant


def identify_flagcols(df: DataFrame, seriescol: str) -> list:
def identify_flagcols(df: DataFrame, seriescol: str, exclude_ustar_ids: list = None) -> list:
"""Identify flag columns."""
flagcols = [c for c in df.columns
if str(c).startswith('FLAG_')
Expand All @@ -43,4 +43,10 @@ def identify_flagcols(df: DataFrame, seriescol: str) -> list:
relevant = identify_relevants(seriescol=seriescol)
flagcols = [f for f in flagcols if any(n in f for n in relevant)]

# Use USTAR ID strings to remove flags from other USTAR scenarios (Level-3.3).
# This keeps flags for the current USTAR scenario by excluding all other scenarios.
# Searching for the string of the current scenario would not work
if exclude_ustar_ids:
flagcols = [f for f in flagcols if not any(n in f for n in exclude_ustar_ids)]

return flagcols
30 changes: 4 additions & 26 deletions diive/core/ml/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@
from pandas import DataFrame
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.metrics import PredictionErrorDisplay, max_error, median_absolute_error, mean_absolute_error, \
mean_absolute_percentage_error, r2_score, mean_squared_error, root_mean_squared_error
from sklearn.metrics import PredictionErrorDisplay
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from yellowbrick.regressor import PredictionError, ResidualsPlot

import diive.core.dfun.frames as fr
from diive.core.times.times import TimestampSanitizer
from diive.core.times.times import include_timestamp_as_cols
from diive.pkgs.gapfilling.scores import prediction_scores

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 12)
Expand Down Expand Up @@ -335,7 +335,7 @@ def trainmodel(self,

# Scores
print(f">>> Calculating prediction scores based on predicting unseen test data of {self.target_col} ...")
self._scores_traintest = prediction_scores_regr(predictions=pred_y_test, targets=y_test)
self._scores_traintest = prediction_scores(predictions=pred_y_test, targets=y_test)

if showplot_scores:
print(f">>> Plotting observed and predicted values ...")
Expand Down Expand Up @@ -820,7 +820,7 @@ def _fillgaps_fullmodel(self, showplot_scores, showplot_importance):

# Scores, using all targets
print(f">>> Calculating prediction scores based on all data predicting {self.target_col} ...")
self._scores = prediction_scores_regr(predictions=pred_y, targets=y)
self._scores = prediction_scores(predictions=pred_y, targets=y)

if showplot_scores:
print(f">>> Plotting observed and predicted values based on all data ...")
Expand Down Expand Up @@ -1105,28 +1105,6 @@ def _define_cols(self):
# return importances


def prediction_scores_regr(predictions: np.array,
targets: np.array) -> dict:
"""
Calculate prediction scores for regression estimator
See:
- https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
"""

# Calculate stats
scores = {
'mae': mean_absolute_error(targets, predictions),
'medae': median_absolute_error(targets, predictions),
'mse': mean_squared_error(targets, predictions),
'rmse': root_mean_squared_error(targets, predictions),
'mape': mean_absolute_percentage_error(targets, predictions),
'maxe': max_error(targets, predictions),
'r2': r2_score(targets, predictions)
}
return scores


def plot_feature_importance(feature_importances: pd.DataFrame, n_perm_repeats: int):
fig, axs = plt.subplots(ncols=1, figsize=(9, 16))
_fidf = feature_importances.copy().sort_values(by='PERM_IMPORTANCE', ascending=True)
Expand Down
4 changes: 2 additions & 2 deletions diive/core/plotting/heatmap_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,15 +223,15 @@ def set_cmap(cmap, color_bad, z):
return cmap, z

def format(self, ax_xlabel_txt, ax_ylabel_txt, plot):
title = self.title if self.title else f"{self.series.name} in {self.series.index.freqstr} time resolution"
title = self.title if self.title else f"{self.series.name} ({self.series.index.freqstr})"
self.ax.set_title(title, color='black')
# Colorbar
cb = plt.colorbar(plot, ax=self.ax, format=f"%.{int(self.cb_digits_after_comma)}f",
label=self.zlabel)
cb.set_label(label=self.zlabel, size=self.axlabels_fontsize)
cb.ax.tick_params(labelsize=self.cb_labelsize)
default_format(ax=self.ax, ax_xlabel_txt=ax_xlabel_txt, ax_ylabel_txt=ax_ylabel_txt,
ticks_direction='out', ticks_length=8, ticks_width=2,
ticks_direction='out', ticks_length=4, ticks_width=2,
ax_labels_fontsize=self.axlabels_fontsize,
ticks_labels_fontsize=self.ticks_labelsize)
format_spines(ax=self.ax, color='black', lw=2)
21 changes: 21 additions & 0 deletions diive/pkgs/createvar/events.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from datetime import datetime
from pathlib import Path

from diive.core.io.files import load_parquet

SOURCEDIR = r"F:\Sync\luhk_work\20 - CODING\29 - WORKBENCH\dataset_cha_fp2024_2005-2023\40_FLUXES_L1_IRGA+QCL+LGR_mergeData"
FILENAME = r"41.1_CH-CHA_IRGA_LGR+QCL_Level-1_eddypro_fluxnet_2005-2023_meteo7.parquet"
FILEPATH = Path(SOURCEDIR) / FILENAME
maindf = load_parquet(filepath=FILEPATH)
locs = (maindf.index.year >= 2023) & (maindf.index.year <= 2023)
df = maindf.loc[locs, 'FC'].copy()
df['EVENT'] = 0 # 0 = no event

e_date = ['2023-06-14', '2023-06-19']

start = datetime.strptime(e_date[0], '%Y-%m-%d').date()
end = datetime.strptime(e_date[1], '%Y-%m-%d').date()

locs = (df.index.date >= start) & (df.index.date <= end)
df = df.loc[locs]
print(df)
Loading

0 comments on commit d6e0481

Please sign in to comment.