Skip to content

Commit

Permalink
Merge pull request #2 from nicholasbalasus/dev
Browse files Browse the repository at this point in the history
merge dev into main
  • Loading branch information
nicholasbalasus authored Jul 11, 2023
2 parents 78583e5 + ac2cff3 commit 6a70d95
Show file tree
Hide file tree
Showing 28 changed files with 2,671 additions and 3,221 deletions.
41 changes: 15 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,15 @@
# Blended TROPOMI GOSAT Methane Product
The entire project can be run by running `run.sh` (e.g., `sbatch -J bash -p seas_compute -t 3-00:00 --mem 1000 --wrap "bash run.sh" --output run.out`). The files that run, where they run, where the outputs are saved, and everything else is controlled by `config.yml`. All of the code that I have written is in `src/`, while code from others is in `tools/`.

The project is broken into modules for downloading data, processing data, and writing data. Each of these modules is broken down below with approximations for their run times (run on the `serial_requeue` partion of Harvard's Cannon cluster), number of cores requested, and total amount of memory requested. Because I was using `serial_requeue`, the resources requested are large. These can be reduced in exchanged for longer run times. At the end of the project, the storage directory specificed in `config.yml` will be ~1.3 TB. After all of the modules have been run, `notebooks/paper.ipynb` can be run to make the figures.

* Module 1: Download data
* **Download_GOSAT**: download GOSAT level 2 data from UoL for 2018-2021 (~5 minutes, 1 core, 4 GB).
* **Download_TROPOMI**: download TROPOMI level 2 data for 2018-2021 from the SRON ftp (~200 minutes, 8 cores, 32 GB).
* **Download_TCCON**: download TCCON data from tccondata.org (~1 minute, 1 core, 4 GB).

* Module 2: Process data
* **Process_GOSAT**: process all daily netCDF GOSAT data to one dataframe (~210 minutes, 1 core, 8 GB).
* **Process_TROPOMI**: process each netCDF TROPOMI file to a pickled dataframe (~100 minutes, 1024 cores, 3072 GB).
* **Pair_TROPOMI_GOSAT**: pair TROPOMI and GOSAT measurements with time and distance thresholds specificed in `config.yml` (~1000 minutes, 1024 cores, 3072 GB).
* **Process_TROPOMI_GOSAT_Pairs**: concatenate all pairs and calculate delta(TROPOMI-GOSAT) (~10 minutes, 1 core, 160 GB).
* **Pair_GOSAT_TCCON**: make dataframes of GOSAT/TCCON pairs (with and without global GOSAT offset) for each TCCON site (~2 minutes, 25 cores, 160 GB).
* **Run_FLAML**: train models to predict delta(TROPOMI-GOSAT) (~90 minutes, 8 cores, 64 GB).
* **Predict_Delta_GOSAT_TROPOMI**: predict and remove delta(TROPOMI-GOSAT) from all TROPOMI data (~10 minutes, 1024 cores, 3072 GB).
* **Pair_TROPOMI_TCCON**: make dataframes of TROPOMI/TCCON pairs and Blended/TCCON pairs for each TCCON site (~100 minutes, 50 cores, 250 GB).
* **SHAP_Explainer**: make SHAP explainer and calculate shap values for train data (~320 minutes, 1 core, 64 GB).

* Module 3: Write data
* **Write_NetCDF**: write netCDF files that mimic the original TROPOMI data but add a variable for the blended product (~5 minutes, 512 cores, 1536 GB).
* **Paired_Regrid**: regrid the TROPOMI and GOSAT pairs to a standard grid (~180 minute, 1 core, 128 GB).
* **TROPOMI_Regrid**: regrid the TROPOMI data to a standard grid (~1100 minutes, 1 core, 200 GB).
* **Oversample_TROPOMI**: for specific regions, oversample TROPOMI data to 0.01 degrees (~600 minutes, 4 cores, 1200 GB).
# Blended TROPOMI+GOSAT Methane Product
The entire project can be run by running `run.sh` (e.g., `sbatch -J bash -p huce_intel -t 14-00:00 --mem 32000 --wrap "bash run.sh" --output run.out`). The files that run, where they run, where the outputs are saved, and everything else is controlled by `config.yml`. All of the code that I have written is in `src/`, while code from others is in `tools/`.

The project is broken into modules, each with approximations for their run times (run on the `huce_ice` partion of Harvard's Cannon cluster), number of cores requested, and total amount of memory requested. At the end of the project, the storage directory specificed in `config.yml` will be ~1 TB. After all of the modules have been run, `notebooks/paper.ipynb` can be run to make the figures.

0. **Make_Conda_Env**: make or update the conda environment specified in `environment.yml`.
1. **Download_GOSAT**: download GOSAT data from UoL for 2018-2021 (~5 minutes, 1 core, 4 GB).
2. **Download_TROPOMI**: download operational TROPOMI data for 2018-2021 from the copernicus hub (~1.5 days, 8 cores, 32 GB).
3. **Download_TCCON**: download TCCON data from tccondata.org (~1 minute, 1 core, 4 GB).
4. **Calculate_Delta_GOSAT_TCCON**: for each TCCON station, find GOSAT pairs for TCCON observations and calculate delta(GOSAT-TCCON) (~3.5 hours, 25 cores, 500 GB).
5. **Calculate_Delta_TROPOMI_GOSAT**: pair TROPOMI and GOSAT measurements and calculate delta(TROPOMI-GOSAT) (~4 days, 64 cores, 500 GB).
6. **Run_FLAML_SHAP**: train models to predict delta(TROPOMI-GOSAT) then run SHAP (~4.5 hours, 8 cores, 64 GB).
7. **Write_Blended_Files**: write netCDF files with an added variable for `methane_mixing_ratio_blended` (~1.5 hours, 16 jobs at 64 cores, 500 GB)
8. **Calculate_Delta_TROPOMI_TCCON**: for each TCCON station, find TROPOMI pairs for TCCON observations and calculate delta(TROPOMI-TCCON) (~10 hours, 25 jobs at 64 cores, 500 GB).
9. **Oversample_TROPOMI**: oversample the TROPOMI and Blended data to a 0.01 degree grid for 2021 (~8.5 hours, 2 cores, 500 GB).
36 changes: 13 additions & 23 deletions config.yml
Original file line number Diff line number Diff line change
@@ -1,32 +1,22 @@
RunDir: "/n/home06/nbalasus/blended_tropomi_gosat_methane"
StorageDir: "/n/holylfs05/LABS/jacob_lab/nbalasus/blended_tropomi_gosat_methane"
Partition: serial_requeue
Cores: 16
CondaEnv: ch4_env
Partition: huce_ice
CondaEnv: ch4_env
Debug: true

TimeThreshold: 60
DistanceThreshold: 5
GlobalOffsetGOSAT: 8.9
TimeFLAML: 1800
GlobalOffsetGOSAT: 9.2
TimeFLAML: 3600
Model: lgbm
a: 1.26
b: 0.13
a: 1.18
b: -0.40

Make_Conda_Env: true
Download_GOSAT: true
Download_TROPOMI: true
Download_TCCON: true
Process_GOSAT: true
Process_TROPOMI: true
Pair_TROPOMI_GOSAT: true
Process_TROPOMI_GOSAT_Pairs: true
Pair_GOSAT_TCCON: true
Run_FLAML: true
Predict_Delta_GOSAT_TROPOMI: true
Pair_TROPOMI_TCCON: true
SHAP_Explainer: true
Write_NetCDF: true
Paired_Regrid: true
GOSAT_Regrid: true
TROPOMI_Regrid: true
Oversample_TROPOMI: true
Calculate_Delta_GOSAT_TCCON: true
Calculate_Delta_TROPOMI_GOSAT: true
Run_FLAML_SHAP: true
Write_Blended_Files: true
Calculate_Delta_TROPOMI_TCCON: true
Oversample_TROPOMI: true
29 changes: 15 additions & 14 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
name: ch4_env
name: blnd_env
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- matplotlib=3.5.2
- numpy=1.21.5
- jupyterlab=3.4.0
- ipykernel=6.15.2
- matplotlib=3.7.1
- numpy=1.23.5
- jupyterlab=3.6.1
- ipykernel=6.19.2
- notebook=6.5.2
- jupyter=1.0.0
- netcdf4=1.5.8
- pandas=1.4.2
- cartopy=0.20.2
- pyproj=3.3.1
- scipy=1.7.3
- netcdf4=1.6.3
- pandas=1.5.3
- cartopy=0.21.1
- pyproj=3.4.1
- scipy=1.10.0
- shap=0.41.0
- geopy=2.2.0
- geopandas=0.10.2
- flaml=1.0.14
- pyyaml=6.0
- geopy=2.3.0
- geopandas=0.12.2
- flaml=1.1.3
- pyyaml=6.0
- scikit-learn=1.2.0
Loading

0 comments on commit 6a70d95

Please sign in to comment.