Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEMORY: Garbage collect future process after value has been collected #69

Closed
HenrikBengtsson opened this issue Apr 26, 2016 · 3 comments

Comments

@HenrikBengtsson
Copy link
Collaborator

Issue

When using multiprocess futures that relies on PSOCK cluster nodes, background R sessions or forked processes ("multicore") there might be large objects left behind in those processes after we've collect the value. The processes will keep being alive in the background. Thus, if we run say 20 processes and 19 of them finish early and one keeps processing a long time there after, we occupy unnecessary memory (RAM) due to those 19 processes.

Suggestion

After retrieving the value of a future:

  1. clean up the working environment
  2. launch the garbage collector explicitly
    of the R environment in process where the future was resolved.

We already do Step 1 before launching new futures for some of the multiprocess future types. We don't garbage collect explicitly anywhere. Also, Step 1 should not be done for "persistent" futures (persistent=TRUE).

It is not clear to me if it is possible to run code after the value has been retrieved for all types of futures. This might be an issue.

@HenrikBengtsson
Copy link
Collaborator Author

The simplest solution might be to wrap up the future expression expr is something like:

expr_gc <- {
  value <- local(expr)
  gc()
  value
}

This can be used when local=TRUE and persistent=FALSE, which is the most common use case.

Should we add a gc=TRUE argument to all futures to be able to control this? For instance, futures evaluated in the current R process may have gc=FALSE by default, whereas those evaluated in external processes has gc=TRUE by default.

@HenrikBengtsson
Copy link
Collaborator Author

Added gc=FALSE for all futures.

For now, we'll leave it up to the user to specify it, which is somewhat sub-optimal, but good enough for now until proven useful/needed. Also, it might be that some futures should be garbage collected whereas other might not. The develop can control this using:

x <- future({ expr }, gc=TRUE)
x %<-% { expr } %tweak% list(gc=TRUE)

@HenrikBengtsson
Copy link
Collaborator Author

If/when we collect memory and timing stats per future (Issue #59), we could use these to decide whether running the garbage collector is necessary (e.g. enough memory was allocated) and / or would only take a fractional amount of time relative to the total evaluation time of the future (i.e. for long running futures, the time that the garbage collector consumes will be relatively small).

For instance, we can have options controlling when the garbage collector should be run, e.g.

  • options(future.gc="auto") # Alternatives, FALSE and TRUE
  • options(future.gc.threshold.time=30) # >= 30 seconds
  • options(future.gc.threshold.memory=100) # >= 100 MiB more RAM allocated since start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant