nested futures that doesn't terminate #348

EmilRehnberg opened this issue Nov 7, 2019 · 9 comments

EmilRehnberg opened this issue Nov 7, 2019 · 9 comments


I am having issues with nested future calls.

Here I'm using mlr3 and furrr, both using future package. When using them both together I'm having issues to have the parallelized calls to terminate.

dtas_lst <-
    dta1 = iris %>% dplyr::select(-Species) %>% utils::head(75),
    dta2 = iris %>% dplyr::select(-Species) %>% utils::tail(75)

run_xgb <- function(dta, target_column = "Sepal.Length") {
  ttsk <- mlr3::TaskRegr$new("ttsk", backend = dta, target = target_column)
  learner <-
      objective = "reg:linear",
      eval_metric = "rmse",
      nrounds = 100,
      verbose = 0
furrr::future_map(dtas_lst, run_xgb, .progress = TRUE) # works!
future::plan(list(future::multiprocess, future::sequential))
furrr::future_map(dtas_lst, run_xgb, .progress = TRUE) # doesn't terminate

I've had cases where the parallel runs do work once during a session, but then not terminating on a rerun.

Sometimes when closing down the session (after manually terminating the parallel run), I get the below message

# Error while shutting down parallel: unable to terminate some child processes

Upon googling this, I found that I could do the following check.

parallel::mcparallel(scan(n = 1, quiet = TRUE))
# $pid
# [1] 32763

# $fd
# [1] 21 24

# attr(,"class")
# [1] "parallelJob"  "childProcess" "process"

Does anyone have any solutions to this? Is this perhaps a bug? Related to future or perhaps parallel or other package? I was having the same issue when I was running the mlr package.

"#75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019"
Are you saying it hangs when you use:

future::plan(list(future::multiprocess, future::sequential))

but not


which effectively should be identical because when no more plans are specified, it'll fall back to using sequential?

Also, what's you sessionInfo()?

Using plan(sequential) works well.
Using any parallelization does not. The furrr progress bar goes to 100% and hangs / don't seem to finish.

Thank you for the quick reply. Can you reproduce this?

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] mlr3learners_0.1.4 mlr3_0.1.4-9000    nvimcom_0.9-83     setwidth_1.0-4
[5] devtools_2.2.1     usethis_1.5.1      magrittr_1.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2          pillar_1.4.2        compiler_3.6.1
 [4] prettyunits_1.0.2   R.methodsS3_1.7.1   mlr3misc_0.1.5-9000
 [7] R.utils_2.9.0       remotes_2.1.0       tools_3.6.1
[10] testthat_2.2.1      digest_0.6.22       pkgbuild_1.0.6
[13] pkgload_1.0.2       uuid_0.1-2          evaluate_0.14
[16] tibble_2.1.3        memoise_1.1.0       checkmate_1.9.4
[19] pkgconfig_2.0.3     rlang_0.4.1         reprex_0.3.0.9000
[22] cli_1.1.0           parallel_3.6.1      xfun_0.10
[25] knitr_1.25          withr_2.1.2         dplyr_0.8.3
[28] globals_0.12.4      desc_1.2.0          fs_1.3.1
[31] tidyselect_0.2.5    rprojroot_1.3-2     glue_1.3.1
[34] data.table_1.12.6   listenv_0.7.0       R6_2.4.0
[37] processx_3.4.1      rmarkdown_1.16      sessioninfo_1.1.1
[40] clipr_0.7.0         whisker_0.4         purrr_0.3.3
[43] callr_3.3.2         lgr_0.3.3           htmltools_0.4.0
[46] codetools_0.2-16    backports_1.1.5     ps_1.3.0
[49] ellipsis_0.3.0      assertthat_0.2.1    future_1.14.0
[52] paradox_0.1.0-9000  crayon_1.3.4        R.oo_1.23.0

Also, I think I forgot to mention that if I run do furrr and mlr3 parallel runs without nesting, everything's working great.

I got a segfault when I ran the parallelized run on a mac. (Sequential run was fine)

Mac session info

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] purrr_0.3.2.9000    mlr3learners_0.1.4  mlr3_0.1.4          nvimcom_0.9-83      setwidth_1.0-4      devtools_2.0.1      usethis_1.4.0
[8] magrittr_1.5.0.9000

loaded via a namespace (and not attached):
 [1] progress_1.2.2.9000    tidyselect_0.2.5       remotes_2.0.2          listenv_0.7.0          lattice_0.20-38        vctrs_0.2.0.9002
 [7] testthat_2.0.1         paradox_0.1.0          rlang_0.4.0.9002       pkgbuild_1.0.3         R.oo_1.22.0            pillar_1.4.2.9001
[13] glue_1.3.1.9000        withr_2.1.2.9000       R.utils_2.8.0          xgboost_0.90.0.2       sessioninfo_1.1.1      uuid_0.1-2
[19] R.methodsS3_1.7.1-9000 future_1.15.0          codetools_0.2-15       memoise_1.1.0          callr_3.3.1.9000       ps_1.3.0
[25] parallel_3.6.1         fansi_0.4.0            furrr_0.1.0            Rcpp_1.0.2             backports_1.1.3        checkmate_1.9.4
[31] desc_1.2.0             pkgload_1.0.2          fs_1.3.1               hms_0.4.2              digest_0.6.20          stringi_1.4.3
[37] processx_3.4.1.9000    dplyr_0.8.3.9000       rprojroot_1.3-2        grid_3.6.1             cli_1.9.9.9000         tools_3.6.1
[43] tibble_2.99.99.9005    mlr3misc_0.1.5         crayon_1.3.4           pkgconfig_2.0.2        zeallot_0.1.0          Matrix_1.2-17
[49] data.table_1.12.3      prettyunits_1.0.2      assertthat_0.2.1       lgr_0.3.3              R6_2.4.0               globals_0.12.4
[55] compiler_3.6.1

Hi. I haven't had time to reproduce/troubleshoot your original problem, but I'm suspect you're hitting issues because forked processes are in place (=multiprocess -> multicore) and some of the packages are not fork-proof. On top of that you might hit issues because there's multi-threading going on to that is then forked (e.g. xgboost). I don't have a reference of hand, but I think some of the R code folks (Tomas Kalibera?) have strongly adviced against using forking and multi-threading at the same time.

I would switch to using multisession. That will most likely reveal that there's code in there that can not be parallelized.

Just to clarify, this is not specific to the future framework. So, if the fork+threading is the cause of your problem, then there is no solution to it. Instead, you need to turn to workarounds, such as launching xgboost on each worker independently of the others.

Oh... and I forgot to say, set

options(future.globals.onReference = "error")

that will help identify cases where there're non-exportable objects in the parallelized code, cf.

Thank you so much for the feedback @HenrikBengtsson
This is a well-known factor for you then. Should I add xgboost as mention in your docs?

I wouldn't call it a well-known factor for me. I've only heard anecdotal from various savvy R users, but I don't have a solid example yet (at least not one I recall).

I do not know for sure that xgboost is to blame here, so I don't really want to add it to the docs without further proof. It would be great to get a minimal reproducible (as far as possible) example showing how it can happen. If one then can force xgboost to run in single-threaded mode, and it does not fail then, then we have a likely candidate for this and for documenting it. To force single-threaded processing, it could be that some of the comments in #255 are helpful.

A lower hanging fruit might be:

options(future.globals.onReference = "error")

If that's the culprit and one can find a reasonable argument for that being the problem, then it could be added to the list of common examples. (Some onReference errors are false positives because the underlying package was designed to handle lost external pointers).

I can reproduce the "stall" on R 3.6.1 on Linux (Ubuntu 18.04):


run_xgb <- function(dta, target_column = "Sepal.Length") {
  ttsk <- mlr3::TaskRegr$new("ttsk", backend = dta, target = target_column)
  learner <-
      objective = "reg:linear",
      eval_metric = "rmse",
      nrounds = 100,
      verbose = 0

dtas_lst <-
    dta1 = iris %>% dplyr::select(-Species) %>% utils::head(75),
    dta2 = iris %>% dplyr::select(-Species) %>% utils::tail(75)

future::plan("multicore", workers = 2L)

## assert all globals can be exported
options(future.globals.onReference = "error")  ## just in case

y <- furrr::future_map(dtas_lst, run_xgb, .progress = TRUE)

Disabling multi-threading for xgboost using:


seems to fix the problem.

However, for me it also works to set something less that the default eight (8) max threads. Even:


works for me. That confuses me.

See also

There's a related xgboost issue over topepo/caret#1106 (comment).

