[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174

yantosca · 2022-10-04T19:42:12Z

It seems that the 1-year benchmarking scripts have a memory leak. If I try to gcc_vs_gcc, gchp_vs_gcc, and gchp_vs_gchp, and gchp_vs_gcc_diff_of_diffs all turned on, the script exceeds 50GB of memory and is killed by the SLURM scheduler.

This might be due to the joblib parallel package. I am currently using joblib==1.0.1.

I could request more memory but am wondering if we should try to find out how we can free memory. I can try forcing garbage collection to see if it makes a difference.

yantosca · 2022-10-04T19:42:32Z

Tagging @laestrada @lizziel @msulprizio @Jourdan-He @SaptSinha

yantosca · 2022-10-04T20:44:20Z

I've manually freed the larger objects (xarray Dataset objects) at the end of each benchmarking subroutine and called gc.collect() to force manual garbage collection. This might help. See commit 71cdba9

yantosca · 2022-10-04T21:45:57Z

The garbage collection seems to allow the 1-year fullchem benchmark scripts to proceed past the GCHP-vs-GCC comparisons. I have a test running now; if it finishes I will close this issue.

This will be included in PR #164 as well.

yantosca · 2022-10-04T21:51:42Z

Alas, the code runs out of memory

Traceback (most recent call last):
  File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 1435, in <module>
    main()
  File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 1431, in main
    choose_benchmark_type(config)
  File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 92, in choose_benchmark_type
    run_1yr_benchmark(
  File "/n/home09/ryantosca/GC/python/gcpy/benchmark/modules/run_1yr_fullchem_benchmark.py", line 988, in run_benchmark
    ref,
  File "/n/home09/ryantosca/python/gcpy/gcpy/benchmark.py", line 1612, in make_benchmark_emis_plots
    results = Parallel(n_jobs=n_job)(
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=28567393.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I will try to increase the amount of memory used in plot_driver.sh.

yantosca · 2022-10-04T21:57:20Z

Until this situation is fixed, we recommend creating GCC vs GCC, GCHP vs GCC, and GCHP vs GCHP in separate jobs, instead of as one entire job.

yantosca · 2022-10-05T13:42:28Z

This might be an issue in joblib. See https://stackoverflow.com/questions/67495271/joblib-parallel-doesnt-terminate-processes. @laestrada, any ideas on how we can free memory here?

yantosca · 2022-10-05T14:41:37Z

There is also this on the web: https://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html. This might just be how Python is. From the link:

Long running Python jobs that consume a lot of memory while running may not return that memory to the operating system until the process actually terminates, even if everything is garbage collected properly. That was news to me, but it’s true. What this means is that processes that do need to use a lot of memory will exhibit a “high water” behavior, where they remain forever at the level of memory usage that they required at their peak.

Note: this behavior may be Linux specific; there are anecdotal reports that Python on Windows does not have this problem.

This problem arises from the fact that the Python VM does its own internal memory management. It’s commonly know as memory fragmentation. Unfortunately, there doesn’t seem to be any fool-proof method of avoiding it.

stale · 2022-11-09T10:35:16Z

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

yantosca · 2023-08-31T14:35:22Z

The Python environment for GCPy 1.4.0 now uses a newer version of JobLib (1.3.2) that may be more memory-conserving. Also, requesting 100000GB of memory seems to suffice for most benchmark plots. Lastly, setting $OMP_NUM_THREADS to use less cores (i.e. 8 instead of 12) may also help to conserve memory.

I will close out this issue as it is a known problem that memory management in Python under Linux sometimes leaks memory when using parallel threads.

yantosca added the category: Bug Something isn't working label Oct 4, 2022

yantosca added the topic: Benchmark Plots and Tables Issues pertaining to generating plots/tables from benchmark output label Oct 4, 2022

yantosca linked a pull request Oct 4, 2022 that will close this issue

Assorted updates for GEOS-Chem 1-year benchmarking -- closes #163 #164

Merged

yantosca self-assigned this Oct 4, 2022

yantosca assigned laestrada Oct 5, 2022

stale bot added the stale No recent activity on this issue label Nov 9, 2022

yantosca added never stale Never label this issue as stale and removed stale No recent activity on this issue labels Nov 9, 2022

yantosca closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174

[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 5, 2022

yantosca commented Oct 5, 2022

stale bot commented Nov 9, 2022

yantosca commented Aug 31, 2023

[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174

[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174

Comments

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 4, 2022

yantosca commented Oct 5, 2022

yantosca commented Oct 5, 2022

stale bot commented Nov 9, 2022

yantosca commented Aug 31, 2023