Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174

Closed
yantosca opened this issue Oct 4, 2022 · 9 comments · Fixed by #164
Closed

[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174

yantosca opened this issue Oct 4, 2022 · 9 comments · Fixed by #164
Assignees
Labels
category: Bug Something isn't working never stale Never label this issue as stale topic: Benchmark Plots and Tables Issues pertaining to generating plots/tables from benchmark output

Comments

@yantosca
Copy link
Contributor

yantosca commented Oct 4, 2022

It seems that the 1-year benchmarking scripts have a memory leak. If I try to gcc_vs_gcc, gchp_vs_gcc, and gchp_vs_gchp, and gchp_vs_gcc_diff_of_diffs all turned on, the script exceeds 50GB of memory and is killed by the SLURM scheduler.

This might be due to the joblib parallel package. I am currently using joblib==1.0.1.

I could request more memory but am wondering if we should try to find out how we can free memory. I can try forcing garbage collection to see if it makes a difference.

@yantosca yantosca added the category: Bug Something isn't working label Oct 4, 2022
@yantosca
Copy link
Contributor Author

yantosca commented Oct 4, 2022

@yantosca yantosca added the topic: Benchmark Plots and Tables Issues pertaining to generating plots/tables from benchmark output label Oct 4, 2022
@yantosca
Copy link
Contributor Author

yantosca commented Oct 4, 2022

I've manually freed the larger objects (xarray Dataset objects) at the end of each benchmarking subroutine and called gc.collect() to force manual garbage collection. This might help. See commit 71cdba9

@yantosca
Copy link
Contributor Author

yantosca commented Oct 4, 2022

The garbage collection seems to allow the 1-year fullchem benchmark scripts to proceed past the GCHP-vs-GCC comparisons. I have a test running now; if it finishes I will close this issue.

This will be included in PR #164 as well.

@yantosca yantosca self-assigned this Oct 4, 2022
@yantosca
Copy link
Contributor Author

yantosca commented Oct 4, 2022

Alas, the code runs out of memory

Traceback (most recent call last):
  File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 1435, in <module>
    main()
  File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 1431, in main
    choose_benchmark_type(config)
  File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 92, in choose_benchmark_type
    run_1yr_benchmark(
  File "/n/home09/ryantosca/GC/python/gcpy/benchmark/modules/run_1yr_fullchem_benchmark.py", line 988, in run_benchmark
    ref,
  File "/n/home09/ryantosca/python/gcpy/gcpy/benchmark.py", line 1612, in make_benchmark_emis_plots
    results = Parallel(n_jobs=n_job)(
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=28567393.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I will try to increase the amount of memory used in plot_driver.sh.

@yantosca
Copy link
Contributor Author

yantosca commented Oct 4, 2022

Until this situation is fixed, we recommend creating GCC vs GCC, GCHP vs GCC, and GCHP vs GCHP in separate jobs, instead of as one entire job.

@yantosca
Copy link
Contributor Author

yantosca commented Oct 5, 2022

This might be an issue in joblib. See https://stackoverflow.com/questions/67495271/joblib-parallel-doesnt-terminate-processes. @laestrada, any ideas on how we can free memory here?

@yantosca
Copy link
Contributor Author

yantosca commented Oct 5, 2022

There is also this on the web: https://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html. This might just be how Python is. From the link:

Long running Python jobs that consume a lot of memory while running may not return that memory to the operating system until the process actually terminates, even if everything is garbage collected properly. That was news to me, but it’s true. What this means is that processes that do need to use a lot of memory will exhibit a “high water” behavior, where they remain forever at the level of memory usage that they required at their peak.

Note: this behavior may be Linux specific; there are anecdotal reports that Python on Windows does not have this problem.

This problem arises from the fact that the Python VM does its own internal memory management. It’s commonly know as memory fragmentation. Unfortunately, there doesn’t seem to be any fool-proof method of avoiding it.

@stale
Copy link

stale bot commented Nov 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the stale No recent activity on this issue label Nov 9, 2022
@yantosca yantosca added never stale Never label this issue as stale and removed stale No recent activity on this issue labels Nov 9, 2022
@yantosca
Copy link
Contributor Author

The Python environment for GCPy 1.4.0 now uses a newer version of JobLib (1.3.2) that may be more memory-conserving. Also, requesting 100000GB of memory seems to suffice for most benchmark plots. Lastly, setting $OMP_NUM_THREADS to use less cores (i.e. 8 instead of 12) may also help to conserve memory.

I will close out this issue as it is a known problem that memory management in Python under Linux sometimes leaks memory when using parallel threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working never stale Never label this issue as stale topic: Benchmark Plots and Tables Issues pertaining to generating plots/tables from benchmark output
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants