-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested Futures Use More Memory Than They Should #709
Comments
A little more information. I don't know much about memory profiling, so apologies if this is not the best way to present the information, but in the hopes it might be helpful...
|
Update: I fixed my problem. Turns out I was not familiar with how future handles environments. See https://furrr.futureverse.org/articles/gotchas.html and https://furrr.futureverse.org/articles/carrier.html I think I ran into the same problem or at least a very similar problem. Apologies for the somewhat convoluted data reconstruction but it's a simulation of the data that I used when I first encountered it. Here's my reprex: library(future)
library(furrr)
library(purrr)
logistic_model <- function(feature, df_other_vars, formula) {
df <- dplyr::bind_cols(df_other_vars, "x" = feature)
m <- glm(formula(formula),
data = df,
family = binomial(logit))
return(m)
}
nested_map <- function(imputed_versions_feature, ...) {
models <- imputed_versions_feature |>
purrr::map(\(imputed_version_feature)
logistic_model(feature = imputed_version_feature, ...))
return(models[1]) # originally mice::pool call, but not necessary for demonstration
}
gen_names <- function(n = 1) {
mz <- runif(min = 10, max = 200, n = n) |> signif(7)
rt <- runif(min = 0, max = 12, n = n) |> signif(7)
string <- glue::glue("X{mz}_{rt}")
return(string)
}
gen_x <- function(dummy, nr_imputations = 60, n = 1000) {
x <- replicate(nr_imputations, rnorm(n)) |> tibble::as_tibble()
}
list_of_feature_dfs <- gen_names(1024) |>
tibble::as_tibble() |>
tidyr::pivot_wider(names_from = value) |>
purrr::map(gen_x)
df <- tibble::tibble(y = rbinom(1000, 1, 0.5))
seed <- 1309
set.seed(seed)
furrr_options <- furrr::furrr_options(seed = seed)
future::plan(future::multisession, workers = 16)
# no problems
r <- list_of_feature_dfs |>
furrr::future_map(\(feature) nested_map(imputed_versions_feature = feature,
df_other_vars = df,
formula = 'y ~ x'),
.progress = TRUE,
.options = furrr_options)
# same as above but via function call: cpu's never really get going, memory keeps ever increasing - doesn't finish
wrapper <- function(list_of_feature_dfs, formula, df_other_vars, furrr_options) {
results <- list_of_feature_dfs %>%
furrr::future_map(\(feature) nested_map(imputed_versions_feature = feature,
df_other_vars = df_other_vars,
formula = formula),
.progress = TRUE,
.options = furrr_options)
return(results)
}
r_function <- wrapper(list_of_feature_dfs, 'y ~ x', df, furrr_options = furrr_options) Session info:
I think the nested map is not the main culprit for me. It's when I put the future call into a function call that I really run into this issue where the cpu's never really get going, but the memory keeps ever increasing. In fact, I run out of 32GB of memory before the code is close to finishing. I have been able to consistently reproduce this across three different machines (Windows, Windows Server, Docker container running Ubuntu via WSL). Any ideas? Or anything I should look into? Thanks! |
I've been running code with nested loops that keeps running into issues with memory usage and I have been trying to come up with a small example that potentially shows the problem. In the example I am just taking a random square matrix and creating a list of the columns. Obviously you wouldn't use a double loop to do this in R but it is hopefully a simple and clear example that shows when using
purrr
the double loop doesn't increase memory usage while withfurrr
andfuture.apply
the memory usage explodes.As you can see, using the double loop actually decreases memory usage for
purrr
, although it stays very similar, but causes memory usage to explode forfurrr
andfuture.apply
. I ran this example on a 2023 MacBook, but the actual code that I am trying to fix has been running on a Linux cluster. I ran this example usingfurrr
andfuture.apply
because yesterday I logged a bug report about nested loops using future.callr and @HenrikBengtsson pointed out that it was only an issue withfurrr
. Please let me know if there is any additional information I can provide or help I can give in solving this issue and thanks for the wonderful collection of packages!The text was updated successfully, but these errors were encountered: