-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility fails with 0.4.0 build and the latest configuration #266
Comments
You may find the 3rd line in each |
Do you have |
Good question. No, they are not the same for some parameters. Some parameters that were default false in 0.3.0 are now default true! I’ll check them in detail, ensure they are the same and run another test to see if they are consistent. |
Below these parameters have been modified to ensure consistency with MOM parameters, but the bfb repro still fails. Here's where the differences are: https://github.com/minghangli-uni/access-om3-doc/blob/compare_0.3.0_0.4.0/tables/MOM_parameter_all_0.3.0-0.4.0/MOM6-CICE6-0.3.0-0.4.0_nml_diff.md
|
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/cosima-twg-announce/401/56 |
If we stored up-to-date |
Noting for documentation purposes that this list is not correct.
There is actually only one change to the MOM6 parameters needed to reproduce MOM6 answers (see #270 (comment)):
However, even with this parameter set, ACCESS-OM3 is not reproducible across 0.3.1 and 0.4.0 due to reproducibility-breaking changes in CICE, CMEPS and CDEPS - see ACCESS-NRI/access-om3-configs#173 |
@minghangli-uni - did you look at reproducibility over restarts at all? I tried running the model-config-tests restart reproducability and it also failed:
Key points: Did 2x1 day runs and 1 2 day run. Results at the end of the first day seem the same Some fields at the end of the second day are different. The error fields go back to 0 at the start of the second day in the 2x1 day run ? Is that expected ? Still appears to not be very good at counting 24 hours ! |
Are these from 0.3.0 or 0.4.0? Restart repro works in 0.3.0 right? I haven’t checked 0.4.0 yet.
Yes, the error terms (Frac Mass Err, Salin Err, Temp Err) are recalculated independently for each run. They dont persist across runs cuz they are used to track errors within a single run but not across multiple runs. We should focus on terms beforehand. There should be a separate restart repro check for each component. I dont think it comes from mom6 but might come from other components. |
Rats. I spy another rabbit hole |
I confirmed the result using the CI: see ACCESS-NRI/access-om3-configs#177 The failure is comparing 2x1 day consecutive runs to 1x2 day run There are three issues at a first glance:
|
I think the implementation of the test should handle these fine since the test looks for the long-run checksums in the short-run checksums (not so say that things couldn't be improved for clarity)
Yeah, this is a problem... |
Why not compare the hashes of the restarts in the manifests? That should detect any failure of reproducibility, making tests of ocean.stats redundant (except that ocean.stats indicates repro breakdown with finer temporal resolution within a run rather than just detecting it at the end). One wrinkle is that restart manifests track the initial rather than final state, so would need to update the manifests before comparing (does |
It seems like an overly stringent criteria to ? e.g. there's no particular reason for metadata in restart files to be the same for different runs |
But in practice (with om2 at least), I could compare restart md5 hashes to check reproducibility COSIMA/access-om2#266 so I guess there are no timestamps or whatnot in the restarts. Is that still the case with OM3 restarts? (IIRC binhash can't be used because it includes the modification date https://payu.readthedocs.io/en/latest/manifests.html#manifest-contents, but md5 was ok) |
As @anton-seaice pointed out to me at lunch, I'm of course wrong about the first point not being an issue. I'll update the tests to get this sorted. |
I think MOM6 restarts can be timestamped, but with the |
I also think we should be including experiment name, experiment id, author and contact in ALL output which would also break comparing md5 hashes of restart files. Also there were issues reported with netcdf4 files produced by the ParallelIO library with identical contents having different checksums, I don't think they have been resolved (but nor are their many details around). |
Or is breaking md5 a reason for not including all of this? (in restarts I mean, not outputs)
Thanks, I hadn't been aware of that. Is it due to unpredictable chunk ordering or something? |
Below are the 5-day
ocean.stats
results highlighting the repro issue (an issue is raised here #265) when comparing the 0.3.0 and 0.4.0 builds with the 0.25 ryf configuration.Using this existing configuration ACCESS-NRI/access-om3-configs@5842715 and the 0.3.0 build,
ocean.stats
shows,Using this updated configuration ACCESS-NRI/access-om3-configs#169 with the 0.4.0 build, reproducibility fails despite having passed the existing repro test. For this specific case, 33 truncations occur on the 2nd day. But when using the latest topography and grid, the problem worsens, with over 20,000 truncation errors pop up ... (not shown here)
I was suspecting the issue might be related to the CICE grid, so I reverted to the old ice grid while keeping the 0.4.0 build. Then no truncations observed, but reproducibility failures persist...
So, it means the issue is likely not related to the cice grid alone, but also the latest 0.4.0 build? Or this is what 0.4.0 intentionally does? ping @anton-seaice @dougiesquire @chrisb13
The text was updated successfully, but these errors were encountered: