Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assorted updates for GEOS-Chem 1-year benchmarking -- closes #163 #164

Merged
merged 69 commits into from
Oct 5, 2022

Conversation

lizziel
Copy link
Contributor

@lizziel lizziel commented Sep 22, 2022

This update includes the implementation of diff-of-diff plots for 1-year GCHP benchmarks, among other things. Here is a list of the updates:

  1. Use lat/lon 1x1.25 for GCHP vs GCHP comparisons in 1-year benchmark. This avoids an issue with cubed-sphere to cubed-sphere regridding that is encountered in the 14.0 1-yr GCHP benchmark due to different grid resolution from the last 1-yr GCHP benchmark (c24 vs c48).
  2. Regrid GCHP ref and GCHP dev to C48 using sparselt package prior to taking the difference for 1-yr benchmark diff-of-diff plots. This is a simpler way to regrid than using the esmf python tools implemented in GCPy. Previously no regridding was done prior to computing differences for diff-of-diffs since GCHP ref and dev were the same grid resolution. This is not the case for the GCHP 1-yr benchmark of 14.0. It will be the case, however, moving forward, and this code will no longer be needed.
  3. Fix bugs in 1-yr GCHP benchmark code that have to do with restart file paths
  4. Set a single benchmark results directory for 1-yr benchmarks and put all benchmark artifacts in that folder, including both GCHP comparisons and GCC vs GCC comparisons. Previously the GCC vs GCC comparison were written to the GCC dev folder and all GCHP comparisons were written to the GCHP dev folder. This was problematic when running the benchmark on other's people data (permissions issue writing to their rundir) or when using data not stored in a run directory.
  5. Append load to calls to open data to avoid dask arrays when reading via open_mfdataset. Dask arrays cause a problem in sparselt.
  6. Comment out using fake dimension for GCHP vs GCC diff-of-diffs plots. This code was added somewhat recently for the case of ref and dev not having the same time. However, it causes problems when there are more than one time in the file. The quick fix is to disable this feature.
  7. Add the implementation of diff-of-diffs to 1-year benchmark code. Previously it was only implemented for 1-month benchmarks. The 1-yr benchmarks are different in that they have 12 months of data (time series length 12), rather than a single time in the files.

…ries

The new code makes it easier to edit such that all benchmark results
(GCC vs GCC, GCHP vs GCC, GCHP vs GCHP, and GCHP vs GCC diff-of-diffs)
are in a single results directory rather than spread out between
the GCC and GCHP dev directories.

Signed-off-by: Lizzie Lundgren <[email protected]>
This code needs more work since it causes the plotting to crash in
some instances, such as multiple times.

Signed-off-by: Lizzie Lundgren <[email protected]>
This is a requirement for using the resultant xarray dataset as an
argument in sparselt for regridding, if reader using open_mfdataset to
open the file. Using open_mfdataset results in dask arrays which are
not compatible with sparselt.


This update also include minor updates to doc code in benchmark.py as
well as a commented out block to not make benchmark concentration
plots in parallel.
Signed-off-by: Lizzie Lundgren <[email protected]>
…diffs

This update uses the sparselt package to perform regridding. Regridding
weights must be computed in advance. Currently the regrid file is
assumed to be in weightsdir and has filename hard-coded. Using this code
should only be necessary if comparing two ref datasets or two dev
datasets that are different resolutions, which is not typical in
benchmarking.

Signed-off-by: Lizzie Lundgren <[email protected]>
Having separate directories for GCC vs GCC and the GCHP comparisons is
still possible by editing run_1yr_fullchem_benchmark.py variables
base_gcc_resultsdir and base_gchp_resultsdir.

Signed-off-by: Lizzie Lundgren <[email protected]>
# Conflicts:
#	benchmark/modules/run_1yr_fullchem_benchmark.py

Signed-off-by: Lizzie Lundgren <[email protected]>
This update fixes problems generating the mass tables in the 1-year
benchmark simulations. However, it causes a problem with the Ox budget
table generation which needs to be looked at later. The Ox budget code
assumes the restart path is the full path rather than retrieve it from
get_filepath(s).

Signed-off-by: Lizzie Lundgren <[email protected]>
@lizziel lizziel added the topic: Benchmark Plots and Tables Issues pertaining to generating plots/tables from benchmark output label Sep 22, 2022
@lizziel lizziel self-assigned this Sep 22, 2022
lizziel and others added 5 commits September 23, 2022 11:05
Benchmarks include computing OH metrics for both GCHP and GC-Classic.

Signed-off-by: Lizzie Lundgren <[email protected]>
…e 1x1.25

This is a temporary work-around for cubed-sphere to cubed-sphere
regridding not currently working in the benchmark plotting code. It is
only applicable to GCHP vs GCHP 1-yr full chemistry benchmarks because
the upcoming benchmark will compare C24 with C48. It is also only
applicable to level plots (surface and 500 hPa).


Signed-off-by: Lizzie Lundgren <[email protected]>
…functions

gcpy/util.py
- Remove hardwired restart folder file paths in get_filepath() and get_filepaths()

benchmark/1mo_benchmark.yml
benchmark/1yr_fullchem_benchmark.yml
benchmark/1yr_tt_benchmark.yml
- Rename "subdir" tag to "outputs_subdir"
- Added "restarts_subdir" tags

benchmark/modules/run_1yr_fullchem_benchmark.py
benchmark/modules/run_1yr_tt_benchmark.py
benchmark/run_benchmark.py
- Now use "outputs_subdir" to construct paths to the GEOS-Chem OutputDir folders
- Now use "restarts_subdir" to construct paths to the GEOS-Chem restart file folders

Signed-off-by: Bob Yantosca <[email protected]>
gcpy/benchmark.py
- In routine make_column_aod_plots, we set a variable quiet = not verbose,
  but the verbose argument is not used.  This causes verbose output to
  be printed to the screen.  To fix this, we now set verbose=False
  in the argument list.

Signed-off-by: Bob Yantosca <[email protected]>
gcpy/benchmark.py
- Remove duplicate definition of verbose=False
- Make sure that compare_varnames uses quiet=(not verbose) in all
  routines where verbose is passed
- Added verbose=False keyword in the make_benchmark_operations_budget

Signed-off-by: Bob Yantosca <[email protected]>
@yantosca yantosca changed the title Assorted updates for GEOS-Chem 1-year benchmarking Assorted updates for GEOS-Chem 1-year benchmarking -- closes #163 Sep 26, 2022
These updates were rendered obsolete, as we can now specify the
outputs_subdir and restarts_subdir for gcc/gchp in the YAML files.

benchmark/1mo_benchmark.yml
- Remove "is_pre_14.0" tag from ref:gcc and dev:gcc entries

benchmark/run_benchmark.py:
- Remove gcc_is_pre_14.0 from calls to get_filepath

gcpy/util.py
- Remove gcc_is_pre_14_0 argument

Signed-off-by: Bob Yantosca <[email protected]>
@lizziel
Copy link
Contributor Author

lizziel commented Sep 27, 2022

I am marking this PR for review. @yantosca is concurrently working on benchmark code in GCPy for the upcoming GEOS-Chem 14.0 benchmark, and may still push to this branch. I am marking him as reviewer so that this does not get merged until he gives the ok.

@lizziel lizziel marked this pull request as ready for review September 27, 2022 14:30
gcpy.benchmark.py
- Add new function get_species_database_dir, which takes in a config
  object and returns the path to the directory where the species database
  file (species_database.yml) is located.

benchmark/modules/run_1yr_fullchem_benchmark.py
benchmark/modules/run_1yr_tt_benchmark.py
benchmark/run_benchmark.py
- Now call get_species_database_dir to get the spcdb_dir variable
- Replace "gchp_metname" with "StateMet"

Signed-off-by: Bob Yantosca <[email protected]>
gcpy/benchmark.py
- If we are successful at locating the species database file, then
  print a message.  (We had composed the message as an f-string but
  had never printed it).

Signed-off-by: Bob Yantosca <[email protected]>
benchmark/1mo_benchmark.yml
benchmark/1yr_fullchem_benchmark.yml
benchmark/1yr_tt_benchmark.yml
- Change "spcdb_dir: None" to "spcdb_dir: default".  None is an allowable
  Python keyword but not an allowable YAML keyword.
- Also added &--- YAML headers

Signed-off-by: Bob Yantosca <[email protected]>
@yantosca
Copy link
Contributor

yantosca commented Oct 3, 2022

Also added:
15. Allow specification of species database directory

gcpy/benchmark.py
- databaase -> database
- Removed exclamation point at end of successful print message

Signed-off-by: Bob Yantosca <[email protected]>
benchmark/1mo_benchmark.yml
benchmark/1yr_fullchem_benchmark.yml
benchmark/1yr_tt_benchmark.yml
- Update comments for the paths & data sections for consistency

Signed-off-by: Bob Yantosca <[email protected]>
gcc_vs_gcc:run should be True not False

Signed-off-by: Bob Yantosca <[email protected]>
Copy link
Contributor

@yantosca yantosca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more updates are needed.

@@ -1,12 +1,13 @@
List of GCPy developers (30 Oct 2020)
List of GCPy developers (29 Sep 2022)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an updating of the GCPy authors in advance of version 1.3.0.

@@ -2,7 +2,7 @@ License Agreement for GCPy and related developments
(The MIT "Expat" License, http://opensource.org/licenses/MIT)
==============================================================================

Copyright (c) 2017-2020 GCPy Developers
Copyright (c) 2017-2022 GCPy Developers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the end date to 2022

promote products or services of Licensee, or any third party.

License agreement for matplotlib versions 1.3.0 and later
=========================================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple more license agreements for 3rd-party packages.

@@ -10,49 +10,73 @@
# to gcc_dev (not gcc_ref!). This ensures consistency in version names
# when doing GCHP vs GCC diff-of-diffs (mps, 6/27/19)
# =====================================================================
# configuration for 1 month benchmark
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouped comments together for easier readability

# main_dir: High-level directory containing subdirectories with dat
# results_dir: Directory where plots/tables will be created
# weights_dir: Path to regridding weights
# spcdb_dir: Path to species_database.yml. If equal to None, will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can now specify the path to the species_database.yml file, or use "default" to tell the benchmark scripts to look in one of the Dev folders.

@@ -1646,10 +1627,9 @@ def get_filepath(
col,
date,
is_gchp=False,
gchp_res="00",
gchp_res="c00",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now include the "c" in GCHP resolution string default values.

@@ -1669,8 +1649,10 @@ def get_filepath(
Set this switch to True to obtain file pathnames to
GCHP diagnostic data files. If False, assumes GEOS-Chem "Classic"

gchp_res: int
Cubed-sphere resolution of GCHP data grid. Only needed for restart files.
gchp_res: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment to denote gchp_res is now of type str.

data_list.add(trimmed_path)

# Read next line
# Open file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pylint identified that we should use "with" when opening a file name for reading.

setup.py Outdated
@@ -92,8 +92,9 @@ def _write_version_file():
packages = find_packages(),
include_package_data=True,
install_requires=["xesmf>=0.2.1", "scipy>=1.3.1", "Cartopy>=0.17.0", "pandas>=0.25.1",
"matplotlib>=3.1.1", "tabulate>=0.8.3", "joblib>=0.17.0", "xbpch>=0.3.5",
"numpy>=1.19.1", "PyPDF2>=1.26.0", "sphinx", "sphinx_rtd_theme", "sphinx-autoapi"],
"matplotlib>=3.1.1", "tabulate>=0.8.3", "joblib>=0.17.0", "xbpch>=0.3.5",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, needs to be updated w/ new package versions

self.devrstdir,
"GEOSChem.Restart.{}*.nc4".format(self.y1_str)
)
# Initial restart file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now use get_filepath to get restart file paths

benchmark/1mo_benchmark.yml
- Change GCC restarts_subdir tags to "Restarts", as this is the new
  name of the restarts subdirectory going forward (not "restarts")

benchmark/1yr_fullchem_benchmark.yml
benchmark/1yr_tt_benchmark.yml
- Add settings for current gcpy_test_data (with comments that these
  can be edited if needed)

Signed-off-by: Bob Yantosca <[email protected]>
The ReadTheDocs build should pick up the proper packages from the
docs/source/requirements.yml file.

Signed-off-by: Bob Yantosca <[email protected]>
setup.py
- Now use the same package versions as in environment.yml.  These now
  are specific package versions which should prevent incompatibility
  errors.

Signed-off-by: Bob Yantosca <[email protected]>
benchmark/1mo_benchmark.yml
benchmark/1yr_fullchem_benchmark.yml
benchmark/1yr_tt_benchmark.yml
- Remove gcc:is_pre_14.0.  This is made obsolete by the fix to
  the get_filepath and get_filepaths routines.

Signed-off-by: Bob Yantosca <[email protected]>
@yantosca
Copy link
Contributor

yantosca commented Oct 4, 2022

Also note the changelog updates:

Unreleased

Added

  • New features in benchmarking scripts (@lizziel, @yantosca)
    • Extra print statements (@lizziel)
    • Diff-of-diffs plots for 1-year benchmarks (@lizziel)
    • sparselt is now a GCPy requirement (@lizziel)
    • Add switch for
  • Removed obsolete environment.yml files (@yantosca)
  • Added requirements.yml to docs folder for Sphinx/RTD documentation (@yantosca)
  • New regridding script regrid_restart_file.py (@LiamBindle)

Changed

  • Fixed several issues in benchmarking scripts (@laestrada, @lizziel, @yantosca)
    • Add OMP_NUM_THREADS and OMP_STACKSIZE in plot_driver.sh (@yantosca)
    • Increase requested memory to 50MB in plot_driver.sh (@yantosca)
    • Benchmark scripts print a message upon completion (@yantosca)
    • Linted several benchmarking routines with Pylint (@yantosca)
    • Rewrote algorithm of add_lumped_species_to_dataset for speed (@yantosca)
    • Can now specify the path to species_database.yml for 1yr benchmarks (@yantosca)
    • 1-yr benchmarks now save output in subdirs of the same path (@lizziel)
    • Avoid hardwiring restart file paths in benchmark scripts (@yantosca)
    • Now use outputs_subdir tag from YAML file for paths to diagnostic files (@yantosca)
    • Now use restarts_subdir tag from YAML file for paths to restart files (@yantosca)
    • GCPy now uses proper year for dev in 1-yr benchmarks (@laestrada)
    • Fixed date string issue in benchmarking scripts (@lizziel)
    • Updates for new GCHP restart file format (@lizziel)
  • Updated environment.yml with package versions that work together (@yantosca)
  • Updated the AUTHORS.txt and LICENSE.txt files (@yantosca)

gcpy/benchmark.py
gcpy/budget_ox.py
gcpy/budget_tt.py
gcpy/ste_flux.py
- At the end of each method, manually delete the larger objects
  (such as xarray Datasets) and call gc.collect() to force
  garbage collection. This should hopefully prevent problems with
  the benchmark scripts halting due to exceeding requested memory
  (which seems to happen if all 1-year benchmark artifacts are
  requested).

Signed-off-by: Bob Yantosca <[email protected]>
CHANGELOG.md
- Add info about garbage collection

benchmark/plot_driver.sh
- Request 4 hours of run time via SLURM

Signed-off-by: Bob Yantosca <[email protected]>
@@ -6,7 +6,32 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## Unreleased
### Added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the changelog for 1.3.0

Copy link
Contributor

@yantosca yantosca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of these changes, and we can merge into Dev.

Note, there is still a memory issue (see #174) that I believe is related to joblib. The workaround is to submit gcc_vs_gcc, gchp_vs_gcc,, gchp_vs_gchp, and gchp_vs_gcc_diff_of_diffs as separate jobs (at least for the 1-year benchmarks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment