Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Debug zppy diffs for v3.0.0 #931

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Feb 3, 2025

Description

Summary of Changes

TLDR: Conclusion

I'm confident that v3.0.0 is working correctly since the test (model) and reference (observation) files align. The differences in the "Model - Observation" plots are expected because:

  • Subtle Regridding Differences: xESMF and ESMF have minor differences that lead to slight variations in regridded data.
  • Floating-Point Sensitivity: Comparing floating-point differences can amplify small discrepancies.
  • Minor Data Variations: Even slight differences (e.g., mismatched NaNs) in the test or reference files can result in significant changes in diff.nc.
  • Difference Calculation: Since diff.nc is computed as test minus reference, any small discrepancy can produce large differences, which then appear in the plot.

Debugging Overview

Some "Model - Observation" subplots show large differences between E3SM Diags v2.12.1 and v3.0.0rc2. These plots use the diff.nc file (test - ref). We need to determine why the plot differences are so large in some cases.

1. First, let's compare the output files for test (test.nc) and reference (ref.nc) to see if these align between e3sm_diags versions.

Source: regression_nc.ipynb. With relative tolerance of 1e-5 (rol=1e-5):

  • 30/34 matching files
  • 3/34 mismatch errors due to ccb regional subsetting differences, these are expected
  • 1/34 not equal files -- number of different elements are really small and stats (min, max, mean, sum) are similar (MERRA2-OMEGA below)
    Comparing:
        * /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5-OMEGA-850-ANN-global_test.nc
        * /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5-OMEGA-850-ANN-global_test.nc
        * var_key: OMEGA
    
    Not equal to tolerance rtol=0.0001, atol=0
    
    Mismatched elements: 17 / 64800 (0.0262%)
    Max absolute difference among violations: 2.15263471e-05
    Max relative difference among violations: 0.00060596
    ACTUAL: array([[      nan,       nan,       nan, ...,       nan,       nan,
                  nan],
          [      nan,       nan,       nan, ...,       nan,       nan,...
    DESIRED: array([[      nan,       nan,       nan, ...,       nan,       nan,
                  nan],
          [      nan,       nan,       nan, ...,       nan,       nan,...
    Checking variable OMEGA
    Dev Path: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_test.nc
    Main Path: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_test.nc
    -------------------------------------
    * Min - dev: -248.597013, main: -248.597015
    * Max - dev: 604.741699, main: 604.741211
    * Mean - dev: 0.446415, main: 0.446415
        * Absolute Mean Diff: 1.3810975812988957e-07
        * Relative Mean Diff: 0.000031%
    * Sum - dev: 26022.445543, main: 26022.439453
        * Absolute Sum Diff: 0.006090177481382852
        * Relative Sum Diff: 0.000023%
Great news! The test and reference files are within tolerance, so the first two of three plots show no image diffs. Now, we need to analyze diffs.nc for differences in the third plot, "Model-Observation."

I asked ChatGPT two questions:

Can floating point comparison between the differences of two floating point arrays result in large differences?

Yes, floating-point comparison between the differences of two floating-point arrays can result in large differences due to the following reasons:

  1. Loss of Precision – Floating-point arithmetic is not exact, and small rounding errors can accumulate when computing differences.
  2. Subtraction Amplifies Error – If two nearly equal floating-point numbers are subtracted, the relative error in the result can be significantly larger than the original numbers.
  3. NaN and Masking Effects – Mismatched NaNs, infinities, or slight variations in masked values can cause major differences in computed differences.
  4. Chaining of Differences – If A - B and C - D have small floating-point errors individually, (A - B) - (C - D) can exaggerate those errors, leading to large variations.

Thus, comparing differences of floating-point arrays is unreliable, especially when precision errors and masking differences are involved.

I have a plot called "Model - Observation" which plots the difference of two floating point arrays (model and observation). If I compare the difference in the plots between two branches (main vs. development), should I expect large difference due to shifting nan positions from regridding even if nan counts are the same?

The presence of nan values themselves doesn’t directly cause large numerical differences in the subtraction (i.e., model - observation). However, if regridding causes the positions of these nan values to shift—even if the total count remains the same—it can lead to differences in the resulting "Model - Observation" plot. Here’s why:

  1. Local Data Differences:
    When nan values shift positions between branches, the set of valid (non-nan) data points being subtracted at each grid location can change. Even if the overall count of nan values remains the same, their spatial redistribution may result in different local comparisons. For instance, a cell that is valid in one branch might be nan in the other, and vice versa.

  2. Impact on Interpolation/Regridding:
    Regridding often involves interpolation or other spatial transformations. If the algorithm encounters nan values in different positions, the interpolation might use a different set of neighboring points, leading to subtle or even noticeable differences in the computed values at certain grid points.

  3. Visual Interpretation:
    While the underlying arithmetic operation (subtraction) is straightforward, the visualization can be sensitive to how these differences manifest spatially. Shifting nan positions can change the appearance of the plot, especially near boundaries or regions with steep gradients, even if the overall error statistics (like mean or total error) remain similar.

  4. Consistency Check:
    It’s a good practice to compare the actual arrays (both model and observation) from both branches to see if there are differences beyond just the location of nan values. If the valid numerical data are nearly identical but the nan positions differ, then the plot differences are likely due to how the data are being masked or interpolated during regridding rather than a flaw in the subtraction itself.

In summary:
You should not expect large differences solely due to floating point subtraction. However, if regridding shifts the positions of nan values between the branches, this can lead to changes in the computed differences at specific grid points and alter the appearance of the plot. The differences are a result of the spatial distribution of valid versus invalid data rather than the arithmetic operation itself.

These explanations reinforce that differences are expected in some "Model-Observation" plots.

Comparing diff.nc between codebases is unreliable, but let's still group image diffs by suspected or known causes.

1. Expected due to bug found on old CDAT codebase (here)

Affected variables:

  • MISRCOSP-CLDLOW_TAU1.3_9.4_MISR-ANN-global.png
  • MISRCOSP-CLDLOW_TAU1.3_MISR-ANN-global.png
  • MISRCOSP-CLDLOW_TAU9.4_MISR-ANN-global.png
2. Expected due to RMSE/CORR positioning bug (fixed in this PR)

Affected variables:

  • MERRA2-PSL-ANN-global.png (diff) -- RMSE and CORR text diff
  • ceres_ebaf_surface_v4.1-ALBEDO_SRF-ANN-global.png (diff) -- RMSE/CORR
3. Expected due to nan location mismatches, caused by regridding differences between xCDAT/xESMF and CDAT/ESMF when downscaling variables.

Why are there subtle differences between xESMF and ESMF regridders when downscaling variables?

Subtle differences between xESMF and ESMF regridders when downscaling variables arise due to differences in their underlying algorithms and precision. While both aim to regrid data, xESMF (built on the xCDAT framework) and ESMF (Earth System Modeling Framework) may use slightly different methods for interpolation, handling of missing values, or rounding errors. These discrepancies, though minor, can propagate and result in small but noticeable variations in the regridded data.

Affected variables:

  • HadISST_CL-SST-ANN-global.png (diff)
  • HadISST_PD-SST-ANN-global.png (diff)
  • HadISST_PI-SST-ANN-global.png (diff)
  • MACv2-AODVIS-ANN-global.png (diff)
  • MERRA2-OMEGA-850-ANN-global.png (diff)
  • MERRA2-T-850-ANN-global.png (diff)
  • MERRA2-TREFMXAV-ANN-global.png -(diff)
  • MERRA2-TREFMNAV-ANN-global.png -- (diff) maybe also regional subsetting diff (I think due to "ccb" flag here)
Expected due to land-sea mask differences between xCDAT/xESMF and CDAT/ESMF. Details](
def _apply_land_sea_mask(
ds: xr.Dataset,
ds_mask: xr.Dataset,
var_key: str,
region: Literal["land", "ocean"],
regrid_tool: str,
regrid_method: str,
) -> xr.Dataset:
"""Apply a land or sea mask based on the region ("land" or "ocean").
Parameters
----------
ds: xr.Dataset
The dataset containing the variable.
ds_mask : xr.Dataset
The dataset containing the land sea region mask variable(s).
var_key : str
The key the variable
region : Literal["land", "ocean"]
The region to mask.
regrid_tool : {"esmf", "xesmf", "regrid2"}
The regridding tool to use. Note, "esmf" is accepted for backwards
compatibility with e3sm_diags and is simply updated to "xesmf".
regrid_method : str
The regridding method to use. Refer to [1]_ for more information on
these options.
esmf/xesmf options:
- "bilinear"
- "conservative"
- "conservative_normed" -- equivalent to "conservative" in cdms2 ESMF
- "patch"
- "nearest_s2d"
- "nearest_d2s"
regrid2 options:
- "conservative"
Returns
-------
xr.Dataset
The Dataset with the land or sea mask applied to the variable.
"""
# TODO: Remove this conditional once "esmf" references are updated to
# "xesmf" throughout the codebase.
if regrid_tool == "esmf":
regrid_tool = "xesmf"
# TODO: Remove this conditional once "conservative" references are updated
# to "conservative_normed" throughout the codebase.
# NOTE: this is equivalent to "conservative" in cdms2 ESMF. If
# "conservative" is chosen, it is updated to "conservative_normed". This
# logic can be removed once the CoreParameter.regrid_method default
# value is updated to "conservative_normed" and all sets have been
# refactored to use this function.
if regrid_method == "conservative":
regrid_method = "conservative_normed"
# A dictionary storing the specifications for this region.
specs = REGION_SPECS[region]
# If the region is land or ocean, regrid the land sea mask to the same
# shape (lat x lon) as the variable then apply the mask to the variable.
# Land and ocean masks have a region value which is used as the upper limit
# for masking.
ds_new = ds.copy()
ds_new = _drop_unused_ilev_axis(ds)
output_grid = ds_new.regridder.grid
mask_var_key = _get_region_mask_var_key(ds_mask, region)
ds_mask_new = _drop_unused_ilev_axis(ds_mask)
ds_mask_regrid = ds_mask_new.regridder.horizontal(
mask_var_key,
output_grid,
tool=regrid_tool,
method=regrid_method,
)
# Update the mask variable with a lower limit. All values below the
# lower limit will be masked.
land_sea_mask = ds_mask_regrid[mask_var_key]
lower_limit = specs["value"] # type: ignore
cond = land_sea_mask > lower_limit
# Apply the mask with a condition (`cond`) using `.where()`. Note, the
# condition matches values to keep, not values to mask out, `drop` is
# set to False because we want to preserve the masked values (`np.nan`)
# for plotting purposes.
masked_var = ds_new[var_key].where(cond=cond, drop=False)
ds_new[var_key] = masked_var
return ds_new
).

There seems to be a minor difference in nan locations for the regridded land-sea mask produced using xCDAT + xESMF vs. CDAT + ESMF (method="bilinear"). xCDAT has a little bit more nan values than CDAT. This causes x and y nan location mismatch when comparing array values. I don't know the exact reason why there are differences, but they seem insignificant enough to accept.

The absolute sum, mean, and nan count values are similar across variables. The diff plots also show no visible differences. All of the scripts for debugging these variables can be found here.
-- #794

Affected variables:

  • ERA5-TREFHT-ANN-land.png (diff)
  • MERRA2-TAUXY-ANN-ocean.png (diff)
  • MERRA2-TREFHT-ANN-land.png (diff)
Expected due to differences between xCDAT/xESMF vs. CDAT/ESMF when downscaling variables. These affected variables have large diffs, so I did further investigation.

Affected variables:

  • ERA5_ext-QREFHT-ANN-global.png (diff)
  • ERA5_ext-U10-ANN-global.png (diff)
  • GPCP_v3.2-PRECT-ANN-global.png (diff)

Further debugging to validate expected for affected variables:

  1. Compare the test and ref files within relative tolerance of 1-e5
    Results: All pass
Relative tolerance: 1e-05, Absolute tolerance: 0
Comparing test.nc and ref.nc files
==================================================
QREFHT - ref.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_ref.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_ref.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_ref.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_ref.nc)
Arrays are within relative tolerance.

QREFHT - test.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_test.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_test.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_test.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_test.nc)
Arrays are within relative tolerance.

U10 - ref.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_ref.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_ref.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_ref.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_ref.nc)
Arrays are within relative tolerance.

U10 - test.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_test.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_test.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_test.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_test.nc)
Arrays are within relative tolerance.

PRECT - ref.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_ref.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_ref.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_ref.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_ref.nc)
Arrays are within relative tolerance.

PRECT - test.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_test.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_test.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_test.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_test.nc)
Arrays are within relative tolerance.
  1. Compare the diff.nc files used for "Model - Observation" plot
    Result: Large differences
Comparing diff.nc files
==================================================
QREFHT - diff.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_diff.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_diff.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_diff.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_diff.nc)

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 64795 [/](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/) 64800 (100%)
Max absolute difference among violations: 2.54303768
Max relative difference among violations: 1123.35406349
 ACTUAL: array([[0.095432, 0.09547 , 0.095508, ..., 0.142239, 0.142297, 0.142357],
       [0.116816, 0.116988, 0.117155, ..., 0.164408, 0.164593, 0.164751],
       [0.130151, 0.130483, 0.130807, ..., 0.177853, 0.178253, 0.178665],...
 DESIRED: array([[0.095825, 0.095806, 0.095784, ..., 0.142794, 0.142797, 0.142803],
       [0.117481, 0.117646, 0.117798, ..., 0.165032, 0.165229, 0.165408],
       [0.130066, 0.130319, 0.130569, ..., 0.178232, 0.178459, 0.178696],...

U10 - diff.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_diff.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_diff.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_diff.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_diff.nc)

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 64797 [/](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/) 64800 (100%)
Max absolute difference among violations: 2.318882
Max relative difference among violations: 1363.4496
 ACTUAL: array([[-1.973981, -1.982043, -1.990005, ..., -1.485713, -1.496813,
        -1.507838],
       [-1.794035, -1.803559, -1.810186, ..., -1.385356, -1.391502,...
 DESIRED: array([[-1.766862, -1.777283, -1.787618, ..., -1.275891, -1.28755 ,
        -1.29917 ],
       [-1.732177, -1.732711, -1.732202, ..., -1.345124, -1.346935,...

PRECT - diff.nc
--------------------------------------------------
Main: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_diff.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_diff.nc)
Dev: [/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_diff.nc](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_diff.nc)

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 64796 [/](https://vscode-remote+ssh-002dremote-002bchrysalis.vscode-resource.vscode-cdn.net/) 64800 (100%)
Max absolute difference among violations: 4.72772718
Max relative difference among violations: 773616.0423584
 ACTUAL: array([[-0.14976 , -0.135835, -0.128921, ..., -0.128685, -0.118462,
        -0.120224],
       [-0.053909, -0.051243, -0.051175, ..., -0.03275 , -0.023618,...
 DESIRED: array([[-0.100752, -0.087916, -0.081792, ..., -0.078111, -0.072161,
        -0.072393],
       [-0.075459, -0.068423, -0.064549, ..., -0.056744, -0.045891,...
  1. Compare stats for diff.nc files
    Results -- All close enough to be comfortable (I think)
Comparing stats of diff.nc files
==================================================
QREFHT - diff.nc
--------------------------------------------------
Main: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_diff.nc
Dev: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-QREFHT-ANN-global_diff.nc
           min       max      mean           sum  nan_count
main -6.794676  5.828569  0.248934  16130.897162          0
dev  -7.093775  5.758040  0.250263  16217.108398          0
U10 - diff.nc
--------------------------------------------------
Main: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_diff.nc
Dev: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/ERA5/ERA5_ext-U10-ANN-global_diff.nc
           min       max      mean          sum  nan_count
main -5.607303  8.402979  0.035228  2282.743652          0
dev  -5.130340  7.829661  0.035636  2309.217285          0
PRECT - diff.nc
--------------------------------------------------
Main: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_diff.nc
Dev: /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/GPCP_v3.2/GPCP_v3.2-PRECT-ANN-global_diff.nc
           min       max      mean          sum  nan_count
main -7.050051  8.961453  0.080120  5191.758895          0
dev  -8.213846  8.937169  0.080524  5217.875488          0

Step-by-step for Debugging

  1. Create a run script with a .cfg that isolates the variables below
  2. Run the run_script using v2.12.1 and v3.0.0rc2
  3. Compare the .nc differences for test.nc and ref.nc files (regression_nc.ipynb)
  4. Compare .nc differences for diff.nc files (regression_nc_diffs.ipynb)
  5. Debug differences

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have added tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder added the bug Bug fix (will increment patch version) label Feb 3, 2025
@tomvothecoder tomvothecoder self-assigned this Feb 3, 2025
@@ -743,7 +743,7 @@ def _add_rmse_corr_text(
fontdict = {"fontsize": fontsize}

if left_text_pos is None:
left_text_pos = (0.6335, -0.0105)
left_text_pos = (0.6635, -0.0105)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligns with

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this will fix other 3 panel plots that uses the lat_lon_plot.py? because I do see same problems in other sets that has the string RMSE CORR mis-placed, some are over lapping with figures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I checked, lat_lon, zonal_mean_2d/stratosphere, meridional_mean_2d, polar will all be fixed.

@chengzhuzhang chengzhuzhang self-requested a review February 6, 2025 18:37
@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Feb 6, 2025

@chengzhuzhang For the OMI-MLS-TCO-ANN-60S60N variable, we are getting mismatching shapes for the diff.nc file.
Any suggestions on what we should do here? Related to #739.

Root cause

To determine the lower resolution for regridding, regrid_to_lower_res() in v2.12.1 compares the X axes, while align_grids_to_lower_res() in v3.0.0 compares the "Y" axes.

Debugging

Test var has a shape of (120, 360) while ref var has a shape of (120, 288).

  • v2.12.1 -- diff.nc shape of (120, 288)
    • Compares the "X" axes to determine the lower resolution (<=)
    • Ref var is considered low resolution due to "X" of 288.
    • Test var is regridded from (120, 360) to (120, 288)
  • v3.0.0rc1/rc2 -- diff.nc shape of (120, 360)
    • Compares the "Y" axes to determine the lower resolution (<=)
    • Test and ref vars have equal "Y" of 120.
    • Ref var is regridded from (120, 288) to (120, 360).
Comparing:
    * /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-branch-930-zppy-diffs/lat_lon/OMI-MLS/OMI-MLS-TCO-ANN-60S60N_diff.nc
    * /lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-04-main-zppy-diffs/lat_lon/OMI-MLS/OMI-MLS-TCO-ANN-60S60N_diff.nc
    * var_key: TCO
    
Not equal to tolerance rtol=0.0001, atol=0

(shapes (120, 360), (120, 288) mismatch)
 ACTUAL: array([[4.909916, 5.552422, 6.897028, ..., 7.011181, 8.307238, 6.755878],
       [7.468704, 7.0319  , 7.010069, ..., 4.298189, 5.202745, 6.788332],
       [5.496756, 6.867319, 7.179531, ..., 5.23127 , 5.508911, 6.141758],...
 DESIRED: array([[2.213444, 1.642891, 2.503925, ..., 5.836899, 5.549273, 6.050253],
       [1.589104, 2.051823, 1.479935, ..., 5.849903, 5.876003, 5.855825],
       [1.357033, 1.710564, 1.29594 , ..., 5.891077, 6.22814 , 5.934219],...

@chengzhuzhang
Copy link
Contributor

@tomvothecoder Thank you for the nice work so far. I walked through other sets and noticing problems that we should investigate and try to fix.

viewer/: for taylor diagram: results of TREFHT land is off, this should related to different masking.
polar/ The difference plots have problems at the edge.
meridional_mean_2d/: Max of middle panel = that of upper panel, and rmse is very different compare to original version
zonal_mean_2d/ and zonal_mean_2d_stratosphere/: metrics problem: Max of middle panel = that of upper panel, and rmse is always 0, mean is nan.

Cosmetic.
aerosol_aeronet/: Lat, lon labels diff
zonal_mean_xy/: y axis name uses variable name instead of long name compared to old version, for ERA5-Z3 and T

@chengzhuzhang
Copy link
Contributor

@chengzhuzhang For the OMI-MLS-TCO-ANN-60S60N variable, we are getting mismatching shapes for the diff.nc file.
Any suggestions on what we should do here? Related to #739.

I checked the diff figures, despite the resolution difference, the metrics and results are very close. I think we can address #739 at a later time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Debug zppy diffs for v3.0.0
2 participants