-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/ISSUE] 1-year benchmarking scripts eventually run out of memory? #174
Comments
I've manually freed the larger objects (xarray Dataset objects) at the end of each benchmarking subroutine and called |
The garbage collection seems to allow the 1-year fullchem benchmark scripts to proceed past the GCHP-vs-GCC comparisons. I have a test running now; if it finishes I will close this issue. This will be included in PR #164 as well. |
Alas, the code runs out of memory Traceback (most recent call last):
File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 1435, in <module>
main()
File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 1431, in main
choose_benchmark_type(config)
File "/n/holyscratch01/jacob_lab/ryantosca/bmk/./run_benchmark.py", line 92, in choose_benchmark_type
run_1yr_benchmark(
File "/n/home09/ryantosca/GC/python/gcpy/benchmark/modules/run_1yr_fullchem_benchmark.py", line 988, in run_benchmark
ref,
File "/n/home09/ryantosca/python/gcpy/gcpy/benchmark.py", line 1612, in make_benchmark_emis_plots
results = Parallel(n_jobs=n_job)(
File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/concurrent/futures/_base.py", line 445, in result
return self.__get_result()
File "/n/home09/ryantosca/miniconda3/envs/gcpy/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=28567393.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. I will try to increase the amount of memory used in |
Until this situation is fixed, we recommend creating GCC vs GCC, GCHP vs GCC, and GCHP vs GCHP in separate jobs, instead of as one entire job. |
This might be an issue in joblib. See https://stackoverflow.com/questions/67495271/joblib-parallel-doesnt-terminate-processes. @laestrada, any ideas on how we can free memory here? |
There is also this on the web: https://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html. This might just be how Python is. From the link:
|
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue. |
The Python environment for GCPy 1.4.0 now uses a newer version of JobLib (1.3.2) that may be more memory-conserving. Also, requesting 100000GB of memory seems to suffice for most benchmark plots. Lastly, setting $OMP_NUM_THREADS to use less cores (i.e. 8 instead of 12) may also help to conserve memory. I will close out this issue as it is a known problem that memory management in Python under Linux sometimes leaks memory when using parallel threads. |
It seems that the 1-year benchmarking scripts have a memory leak. If I try to gcc_vs_gcc, gchp_vs_gcc, and gchp_vs_gchp, and gchp_vs_gcc_diff_of_diffs all turned on, the script exceeds 50GB of memory and is killed by the SLURM scheduler.
This might be due to the joblib parallel package. I am currently using joblib==1.0.1.
I could request more memory but am wondering if we should try to find out how we can free memory. I can try forcing garbage collection to see if it makes a difference.
The text was updated successfully, but these errors were encountered: