Skip to content

Releases: holukas/diive

v0.75.0

26 Apr 11:26
e648180
Compare
Choose a tag to compare

v0.75.0 | 26 Apr 2024

XGBoost gap-filling

XGBoost can now be used to fill gaps in time series data.
In diive, XGBoost is implemented in class XGBoostTS, which adds additional options for easily including e.g.
lagged variants of feature variables, timestamp info (DOY, month, ...) and a continuous record number. It also allows
direct feature reduction by including a purely random feature (consisting of completely random numbers) and calculating
the 'permutation importance'. All features where the permutation importance is lower than for the random feature can
then be removed from the dataset, i.e., the list of features, before building the final model.

XGBoostTS and RandomForestTS both use the same base class MlRegressorGapFillingBase. This base class will also
facilitate the implementation of other gap-filling algorithms in the future.

Another fun (for me) addition is the new class TimeSince. It allows to calculate the time since the last occurrence of
specific conditions. One example where this class can be useful is the calculation of 'time since last precipitation',
expressed as number of records, which can be helpful in identifying dry conditions. More examples: 'time since freezing
conditions' based on air temperature; 'time since management' based on management info, e.g. fertilization events.
Please see the notebook for some illustrative examples.

Please note that diive is still under developement and bugs can be expected.

New features

  • Added gap-filling class XGBoostTS for time series data,
    using XGBoost (diive.pkgs.gapfilling.xgboost_ts.XGBoostTS)
  • Added new class TimeSince: counts number of records (inceremental number / counter) since the last time a time
    series was inside a specified range, useful for e.g. counting the time since last precipitation, since last freezing
    temperature, etc. (diive.pkgs.createvar.timesince.TimeSince)

Additions

  • Added base class for machine learning regressors, which is basically the code shared between the different
    methods. At the moment used by RandomForestTS and XGBoostTS. (diive.core.ml.common.MlRegressorGapFillingBase)
  • Added option to change line color directly in TimeSeries plots (diive.core.plotting.timeseries.TimeSeries.plot)

Notebooks

  • Added new notebook for gap-filling using XGBoostTS with mininmal settings (notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb)
  • Added new notebook for gap-filling using XGBoostTS with more extensive settings (notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb)
  • Added new notebook for creating TimeSince variables (notebooks/CalculateVariable/TimeSince.ipynb)

Tests

  • Added test case for XGBoost gap-filling (tests.test_gapfilling.TestGapFilling.test_gapfilling_xgboost)
  • Updated test case for random forest gap-filling (tests.test_gapfilling.TestGapFilling.test_gapfilling_randomforest)
  • Harmonized test case for XGBoostTS with test case of RandomForestTS
  • Added test case for TimeSince variable creation (tests.test_createvar.TestCreateVar.test_timesince)

What's Changed

Full Changelog: v0.74.1...v0.75.0

v0.74.1

22 Apr 22:54
b9c0129
Compare
Choose a tag to compare

v0.74.1 | 23 Apr 2024

This update adds the first notebooks (and tests) for outlier detection methods. Only two tests are included so far and
both tests are relatively simple, but both notebooks already show in principle how outlier removal is handled. An
important aspect is that diive single outlier methods do not remove outliers by default, but instead a flag is created
that shows where the outliers are located. The flag can then be used to remove the data points.
This update also includes the addition of a small function that creates artificial spikes in time series data and is
therefore very useful for testing outlier detection methods.
More outlier removal notebooks will be added in the future, including a notebook that shows how to combine results from
multiple outlier tests into one single overall outlier flag.

New features

  • Added: new function to add impulse noise to time series (diive.pkgs.createvar.noise.impulse)

Notebooks

  • Added: new notebook for outlier detection: absolute limits, separately for daytime and nighttime
    data (notebooks/OutlierDetection/AbsoluteLimitsDaytimeNighttime.ipynb)
  • Added: new notebook for outlier detection: absolute limits (notebooks/OutlierDetection/AbsoluteLimits.ipynb)

Tests

  • Added: test case for outlier detection: absolute limits, separately for daytime and
    nighttime data (tests.test_outlierdetection.TestOutlierDetection.test_absolute_limits)
  • Added: test case for outlier detection: absolute
    limits (tests.test_outlierdetection.TestOutlierDetection.test_absolute_limits)

What's Changed

Full Changelog: v0.74.0...v0.74.1

v0.74.0

21 Apr 12:29
6a4d7a2
Compare
Choose a tag to compare

v0.74.0 | 21 Apr 2024

Additions

  • Added: new function to remove rows that do not have timestamp
    info (NaT) (diive.core.times.times.remove_rows_nat and diive.core.times.times.TimestampSanitizer)
  • Added: new settings VARNAMES_ROW and VARUNITS_ROW in filetypes YAML files, allows better and more specific
    configuration when reading data files (diive/configs/filetypes)
  • Added: many (small) example data files for various filetypes, e.g. ETH-RECORD-TOA5-CSVGZ-20HZ
  • Added: new optional check in TimestampSanitizer that compares the detected time resolution of a time series with
    the nominal (expected) time resolution. Runs automatically when reading files with ReadFileType, in which case
    the FREQUENCY from the filetype configs is used as the nominal time
    resolution. (diive.core.times.times.TimestampSanitizer, diive.core.io.filereader.ReadFileType)
  • Added: application of TimestampSanitizer after inserting a timestamp and setting it as index with
    function insert_timestamp, this makes sure the freq/freqstr info is available for the new timestamp
    index (diive.core.times.times.insert_timestamp)

Notebooks

  • General: Ran all notebook examples to make sure they work with this version of diive
  • Added: new notebook for reading EddyPro fluxnet output file with DataFileReader
    parameters (notebooks/ReadFiles/Read_single_EddyPro_fluxnet_output_file_with_DataFileReader.ipynb)
  • Added: new notebook for reading EddyPro fluxnet output file with ReadFileType and pre-defined
    filetype EDDYPRO-FLUXNET-CSV-30MIN (notebooks/ReadFiles/Read_single_EddyPro_fluxnet_output_file_with_ReadFileType.ipynb)
  • Added: new notebook for reading multiple EddyPro fluxnet output files with MultiDataFileReader and pre-defined
    filetype EDDYPRO-FLUXNET-CSV-30MIN (notebooks/ReadFiles/Read_multiple_EddyPro_fluxnet_output_files_with_MultiDataFileReader.ipynb)

Changes

  • Renamed: function get_len_header to parse_header(diive.core.dfun.frames.parse_header)
  • Renamed: exampledata files (diive/configs/exampledata)
  • Renamed: filetypes YAML files to always include the file extension in the file name (diive/configs/filetypes)
  • Reduced: file size for most example data files

Tests

  • Added: various test cases for loading filetypes (tests/test_loaddata.py)
  • Added: test case for loading and merging multiple
    files (tests.test_loaddata.TestLoadFiletypes.test_load_exampledata_multiple_EDDYPRO_FLUXNET_CSV_30MIN)
  • Added: test case for reading EddyPro fluxnet output file with DataFileReader
    parameters (tests.test_loaddata.TestLoadFiletypes.test_load_exampledata_EDDYPRO_FLUXNET_CSV_30MIN_datafilereader_parameters)
  • Added: test case for resampling series to 30MIN time
    resolution (tests.test_time.TestTime.test_resampling_to_30MIN)
  • Added: test case for inserting timestamp with a different convention (middle, start,
    end) (tests.test_time.TestTime.test_insert_timestamp)
  • Added: test case for inserting timestamp as index (tests.test_time.TestTime.test_insert_timestamp_as_index)

Bugfixes

  • Fixed: bug in class DetectFrequency when inferred frequency is None (diive.core.times.times.DetectFrequency)
  • Fixed: bug in class DetectFrequency where pd.Timedelta() would crash if the input frequency does not have a
    number. Timedelta does not accept e.g. the frequency string min for minutely time resolution, even though
    e.g. pd.infer_freq() outputs min for data in 1-minute time resolution. TimeDelta requires a number, in this
    case 1min. Results from infer_freq() are now checked if they contain a number and if not, 1 is added at the
    beginning of the frequency string. (diive.core.times.times.DetectFrequency)
  • Fixed: bug in notebook WindDirectionOffset, related to frequency detection during heatmap plotting
  • Fixed: bug in TimestampSanitizer where the script would crash if the timestamp contained an element that could
    not be converted to datetime, e.g., when there is a string mixed in with the regular timestamps. Data rows with
    invalid timestamps are now parsed as NaT by using errors='coerce'
    in pd.to_datetime(data.index, errors='coerce'). (diive.core.times.times.convert_timestamp_to_datetime
    and diive.core.times.times.TimestampSanitizer)
  • Fixed: bug when plotting heatmap (diive.core.plotting.heatmap_datetime.HeatmapDateTime)

What's Changed

Full Changelog: v0.73.0...v0.74.0

v0.73.0

17 Apr 20:59
b8a9369
Compare
Choose a tag to compare

v0.73.0 | 17 Apr 2024

New features

  • Added new function trim_frame that allows to trim the start and end of a dataframe based on available records of a
    variable (diive.core.dfun.frames.trim_frame)
  • Added new option to export borderless
    heatmaps (diive.core.plotting.heatmap_base.HeatmapBase.export_borderless_heatmap)

Additions

  • Added more info in comments of class WindRotation2D (diive.pkgs.echires.windrotation.WindRotation2D)
  • Added example data for EddyPro full_output
    files (diive.configs.exampledata.load_exampledata_eddypro_full_output_CSV_30MIN)
  • Added code in an attempt to harmonize frequency detection from data: in class DetectFrequency the detected
    frequency strings are now converted from Timedelta (pandas) to offset (pandas) to .freqstr. This will yield
    the frequency string as seen by (the current version of) pandas. The idea is to harmonize between different
    representations e.g. T or min for minutes. Currently it seems that pandas is not consistent with e.g. the
    represenation of minutes, using T in .infer_freq() but min
    for Timedelta (
    see here). (diive.core.times.times.DetectFrequency)

Changes

  • Updated class DataFileReader to comply with new pandas kwargs when
    using .read_csv() (diive.core.io.filereader.DataFileReader._parse_file)
  • Environment: updated pandas to v2.2.2 and pyarrow to v15.0.2
  • Updated date offsets in config filetypes to be compliant with pandas version 2.2+ (
    see here and here), e.g., 30T was changed
    to 30min. This seems to work without raising a warning, however, if frequency is inferred from available data,
    the resulting frequency string shows e.g. 30T, i.e. still showing T for minutes instead
    of min. (diive/configs/filetypes)
  • Changed variable names in WindRotation2D to be in line with the variable names given in the paper by Wilczak et
    al. (2001) https://doi.org/10.1023/A:1018966204465

Removals

  • Removed function timedelta_to_string because this can be done with pandas to_offset().freqstr
  • Removed function generate_freq_str (unused)

Tests

  • Added test case for reading EddyPro full_output
    files (tests.test_loaddata.TestLoadFiletypes.test_load_exampledata_eddypro_full_output_CSV_30MIN)
  • Updated test for frequency detection (tests.test_timestamps.TestTime.test_detect_freq)

What's Changed

Full Changelog: v0.72.1...v0.73.0

v0.72.1

26 Mar 21:15
c90732c
Compare
Choose a tag to compare

v0.72.1 | 26 Mar 2024

  • pyproject.toml now uses the inequality syntax >= instead of caret syntax ^ because the version capping is
    restrictive and prevents compatibility in conda installations. See #74
  • Added badges in README.md
  • Smaller diive logo in README.md

What's Changed

Full Changelog: v0.72.0...v0.72.1

v0.72.0

25 Mar 21:41
2b634b8
Compare
Choose a tag to compare

v0.72.0 | 25 Mar 2024

New feature

  • Added new heatmap plotting class HeatmapYearMonth that allows to plot a variable in year/month
    classes(diive.core.plotting.heatmap_datetime.HeatmapYearMonth)

DIIVE

Changes

  • Refactored code for class HeatmapDateTime (diive.core.plotting.heatmap_datetime.HeatmapDateTime)
  • Added new base class HeatmapBase for heatmap plots. Currently used by HeatmapYearMonth
    and HeatmapDateTime (diive.core.plotting.heatmap_base.HeatmapBase)

Notebooks

  • Added new notebook for HeatmapDateTime (notebooks/Plotting/HeatmapDateTime.ipynb)
  • Added new notebook for HeatmapYearMonth (notebooks/Plotting/HeatmapYearMonth.ipynb)

Bugfixes

  • Fixed bug in HeatmapDateTime where the last record of each day was not shown

What's Changed

Full Changelog: v0.71.6...v0.72.0

v0.71.6

22 Mar 23:58
Compare
Choose a tag to compare

v0.71.6 | 23 Mar 2024

DIIVE

Notebooks

  • Added new notebook for Percentiles (notebooks/Analyses/Percentiles.ipynb)
  • Added new notebook for LinearInterpolation (notebooks/GapFilling/LinearInterpolation.ipynb)
  • Added new notebook for calculating z-aggregates in quantiles (classes) of x and
    y (notebooks/Analyses/CalculateZaggregatesInQuantileClassesOfXY.ipynb)
  • Updated notebook for DaytimeNighttimeFlag (notebooks/CalculateVariable/DaytimeNighttimeFlag.ipynb)

What's Changed

Full Changelog: v0.71.5...v0.71.6

v0.71.5

22 Mar 11:29
Compare
Choose a tag to compare

v0.71.5 | 22 Mar 2024

Changes

  • Updated notebook for SortingBinsMethod (diive.pkgs.analyses.decoupling.SortingBinsMethod)

DIIVE
Plot showing vapor pressure deficit (y) in 10 classes of short-wave incoming radiation (x), separate for 5 classes of
air temperature (z). All values shown are medians of the respective variable. The shaded errorbars refer to the
interquartile range for the respective class. Plot was generated using the class SortingBinsMethod.

v0.71.4

19 Mar 23:05
191d318
Compare
Choose a tag to compare

v0.71.4 | 20 Mar 2024

Changes

  • Refactored class LongtermAnomaliesYear (diive.core.plotting.bar.LongtermAnomaliesYear)

DIIVE

Notebooks

  • Added new notebook for LongtermAnomaliesYear (notebooks/Plotting/LongTermAnomalies.ipynb)

What's Changed

Full Changelog: v0.71.3...v0.71.4

v0.71.3

19 Mar 16:11
6725667
Compare
Choose a tag to compare

v0.71.3 | 19 Mar 2024

Changes

  • Refactored class SortingBinsMethod: Allows to investigate binned aggregates of a variable z in binned classes of x
    and y (see plot below). All bins now show medians and interquartile
    ranges. (diive.pkgs.analyses.decoupling.SortingBinsMethod)

Notebooks

  • Added new notebook for SortingBinsMethod

Bugfixes

  • Added absolute links to example notebooks in README.md

Other

  • From now on, diive is officially published on pypi

What's Changed

Full Changelog: v0.71.2...v0.71.3