Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic branching #1042

Merged
merged 169 commits into from
Nov 3, 2019
Merged

Dynamic branching #1042

merged 169 commits into from
Nov 3, 2019

Conversation

wlandau
Copy link
Member

@wlandau wlandau commented Nov 3, 2019

Summary

At long last, drake is capable of dynamic branching! You can now define new targets while make() is running.

Here is an demo that fits GDP vs year for each country in the Gapminder dataset. We let the data decide how many targets we have and what those targets mean.

library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)
library(gapminder) # https://github.com/jennybc/gapminder

# We need a function to fit the model.
fit_model <- function(dataset, country) {
  dataset %>%
    filter(country == !!country) %>% # The !! is important.
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(country = !!country)
}

plan <- drake_plan(
  # This dataset can change, and we want
  # the downstream targets to update.
  dataset = gapminder,
  
  # We need a grouping variable for our dynamic
  # targets (e.g. models).
  country = unique(dataset$country),
  
  # Fit GDP vs year for each country.
  model = target(
    fit_model(dataset, country),
    dynamic = map(country)
  ),
  
  # Get the continent of each model's country.
  continent = dataset %>%
    distinct(continent, country) %>%
    pull(continent),
  
  # Aggregate the models by continent.
  results = target(
    bind_rows(model) %>%
      mutate(continent = continent),
    dynamic = combine(model, .by = continent)
  )
)

# The targets have cryptic names,
# but there are good reasons to use hashes as suffixes.
# I will discuss if someone wants to.
make(plan)
#> target dataset
#> target country
#> target continent
#> subtarget model_94c60fa4
#> subtarget model_d80d49db
#> subtarget model_1dc4c7bc
#> subtarget model_63e3125f
#> subtarget model_29837144
#> subtarget model_cdf5fd51
#> subtarget model_4c0e0b94
#> subtarget model_973db4f8
#> subtarget model_d61b934b
#### Lots more models, manually cropped.
#> subtarget results_d8941ed2
#> subtarget results_57a69204
#> subtarget results_100845a9
#> subtarget results_265c50a5
#> subtarget results_f4c57f21

# loadd() and readd() understand dynamic targets,
# so you do not need to deal with those funny names.
readd(results, subtargets = seq_len(2))
#> [[1]]
#> # A tibble: 66 x 7
#>    term          estimate std.error statistic  p.value country    continent
#>    <chr>            <dbl>     <dbl>     <dbl>    <dbl> <fct>      <fct>    
#>  1 (Intercep…    1675.      3747.       0.447  6.64e-1 Afghanist… Asia     
#>  2 year            -0.441      1.89    -0.233  8.21e-1 Afghanist… Asia     
#>  3 (Intercep… -535201.     68897.      -7.77   1.52e-5 Bahrain    Asia     
#>  4 year           280.        34.8      8.03   1.14e-5 Bahrain    Asia     
#>  5 (Intercep…  -19975.      4837.      -4.13   2.05e-3 Bangladesh Asia     
#>  6 year            10.5        2.44     4.30   1.57e-3 Bangladesh Asia     
#>  7 (Intercep…  -30194.      7822.      -3.86   3.16e-3 Cambodia   Asia     
#>  8 year            15.6        3.95     3.95   2.75e-3 Cambodia   Asia     
#>  9 (Intercep… -127509.     24515.      -5.20   4.01e-4 China      Asia     
#> 10 year            65.2       12.4      5.26   3.67e-4 China      Asia     
#> # … with 56 more rows
#> 
#> [[2]]
#> # A tibble: 60 x 7
#>    term      estimate std.error statistic  p.value country        continent
#>    <chr>        <dbl>     <dbl>     <dbl>    <dbl> <fct>          <fct>    
#>  1 (Interc…   -1.05e5   23496.      -4.45 1.24e- 3 Albania        Europe   
#>  2 year        5.45e1      11.9      4.59 9.98e- 4 Albania        Europe   
#>  3 (Interc…   -1.04e6   20543.     -50.5  2.23e-13 Austria        Europe   
#>  4 year        5.35e2      10.4     51.5  1.84e-13 Austria        Europe   
#>  5 (Interc…   -8.98e5   24093.     -37.3  4.59e-12 Belgium        Europe   
#>  6 year        4.64e2      12.2     38.1  3.69e-12 Belgium        Europe   
#>  7 (Interc…   -1.92e5   27479.      -7.00 3.73e- 5 Bosnia and He… Europe   
#>  8 year        9.89e1      13.9      7.12 3.20e- 5 Bosnia and He… Europe   
#>  9 (Interc…   -2.07e5   46116.      -4.48 1.17e- 3 Bulgaria       Europe   
#> 10 year        1.08e2      23.3      4.62 9.48e- 4 Bulgaria       Europe   
#> # … with 50 more rows

# The graphs for large workflows are much simpler to look at now.
config <- drake_config(plan)
vis_drake_graph(config)

# Side note: plans have a plot method now.
plot(plan)

Created on 2019-11-03 by the reprex package (v0.3.0)

Performance

Dynamic branching makes drake_config() and outdated() much faster, which makes make() much faster to initialize. However, when it comes to building targets, the per-target overhead for dynamic sub-targets can be twice that of static branching. Not such a big deal for a first implementation. After all, it took a long time to get static branching as fast as it is. But performance will always be a long-term project.

Related GitHub issues and pull requests

Checklist

@codecov-io
Copy link

codecov-io commented Nov 3, 2019

Codecov Report

Merging #1042 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #1042    +/-   ##
=======================================
  Coverage     100%    100%            
=======================================
  Files          47      47            
  Lines        6737    7161   +424     
=======================================
+ Hits         6737    7161   +424
Impacted Files Coverage Δ
R/transform_plan.R 100% <ø> (ø) ⬆️
R/manage_memory.R 100% <100%> (ø) ⬆️
R/drake_meta_.R 100% <100%> (ø) ⬆️
R/handle_triggers.R 100% <100%> (ø) ⬆️
R/hpc.R 100% <100%> (ø) ⬆️
R/drake_plan_helpers.R 100% <100%> (ø) ⬆️
R/drake_config.R 100% <100%> (ø) ⬆️
R/local_build.R 100% <100%> (ø) ⬆️
R/dynamic.R 100% <100%> (ø) ⬆️
R/drake_graph_info.R 100% <100%> (ø) ⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4aa702...c809e2b. Read the comment docs.

@wlandau wlandau mentioned this pull request Nov 3, 2019
@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2019

More details and discussion: #685 (comment)

@wlandau wlandau merged commit 5766e3a into master Nov 3, 2019
@wlandau wlandau deleted the 685 branch November 3, 2019 01:31
@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2019

Forgot to mention: for anything without a dim(), we map over the elements. For anything with a dim(), we map over the first margin. This makes it super easy to map over the rows.

library(drake)
library(tibble)
plan <- drake_plan(
  x = as_tibble(mtcars),
  y = target(x, dynamic = map(x))
)

make(plan, verbose = 0L)

readd(y, subtargets = 1:2)
#> [[1]]
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21     6   160   110   3.9  2.62  16.5     0     1     4     4
#> 
#> [[2]]
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21     6   160   110   3.9  2.88  17.0     0     1     4     4

Created on 2019-11-02 by the reprex package (v0.3.0)

@wlandau wlandau mentioned this pull request Nov 3, 2019
16 tasks
@kendonB
Copy link
Contributor

kendonB commented Nov 3, 2019

for anything without a dim(), we map over the elements. For anything with a dim(), we map over the first margin. This makes it super easy to map over the rows.

I anticipate that this will cause confusion in the case of data.frames. It might be better to have verbose control over which way the iteration goes. This is especially true for map as the purrr function iterates over [[ i.e. columns in the case of data.frames.

Consider making the map dimension variable with the default being what purrr would do?

@kendonB
Copy link
Contributor

kendonB commented Nov 3, 2019

This looks really cool!

How would I get the subtarget associated with a particular value of country here?

readd(results, subtargets = seq_len(2))

@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2019

Re #1042 (comment), I will think about it, but please do not get your hopes up. Row-oriented workflows come up a lot more than looping over columns, so I want the former to take priority. And other workarounds like map_rows() have even more explaining to do when we start thinking about mapping over a mix of data frames and non-data-frames.

Anyway, here is how to get the model for a country and the aggregate for a continent. There is no general way to assign suggestive names during make(), so drake does not even try. (Consistency in all cases is better than convenience in some.) When you do read the sub-target, please take a look at the country or continent column to make sure your read the right value.

library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)
library(gapminder)

fit_model <- function(dataset, country) {
  dataset %>%
    filter(country == !!country) %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(country = !!country)
}

plan <- drake_plan(
  dataset = gapminder,
  country = unique(dataset$country),
  model = target(
    fit_model(dataset, country),
    dynamic = map(country)
  ),
  continent = dataset %>%
    distinct(continent, country) %>%
    pull(continent),
  results = target(
    bind_rows(model) %>%
      mutate(continent = continent),
    dynamic = combine(model, .by = continent)
  )
)

make(plan, verbose = 0L)

countries <- readd(country)
index <- which(countries == "Philippines")
readd(model, subtargets = index)[[1]]
#> # A tibble: 2 x 6
#>   term        estimate std.error statistic    p.value country    
#>   <chr>          <dbl>     <dbl>     <dbl>      <dbl> <fct>      
#> 1 (Intercept) -53730.    6773.       -7.93 0.0000127  Philippines
#> 2 year            28.2      3.42      8.25 0.00000895 Philippines

continents <- unique(readd(continent))
index <- which(continents == "Oceania")
readd(results, subtargets = index)
#> [[1]]
#> # A tibble: 4 x 7
#>   term       estimate std.error statistic      p.value country    continent
#>   <chr>         <dbl>     <dbl>     <dbl>        <dbl> <fct>      <fct>    
#> 1 (Intercep… -824973.   47418.      -17.4      8.34e-9 Australia  Oceania  
#> 2 year           427.      24.0      17.8      6.61e-9 Australia  Oceania  
#> 3 (Intercep… -459336.   26886.      -17.1      9.96e-9 New Zeala… Oceania  
#> 4 year           241.      13.6      17.7      6.96e-9 New Zeala… Oceania

Created on 2019-11-03 by the reprex package (v0.3.0)

@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2019

FYI, I decided to extend dynamic combine() to work with static targets. Could be useful in non-row-wise data splitting situations.

library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)
library(gapminder)

fit_model <- function(dataset, country) {
  dataset %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(country = !!country)
}

plan <- drake_plan(
  dataset = gapminder,
  country = dataset$country,
  model = target(
    fit_model(dataset, country),
    dynamic = combine(dataset, .by = country)
  )
)

make(plan, verbose = 0L)

readd(model, subtargets = 1)
#> [[1]]
#> # A tibble: 2 x 6
#>   term        estimate std.error statistic p.value country    
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <fct>      
#> 1 (Intercept) 1675.      3747.       0.447   0.664 Afghanistan
#> 2 year          -0.441      1.89    -0.233   0.821 Afghanistan

Created on 2019-11-03 by the reprex package (v0.3.0)

@wlandau
Copy link
Member Author

wlandau commented Nov 4, 2019

New chapter in the manual: https://ropenscilabs.github.io/drake-manual/dynamic.html

@mik3y64
Copy link

mik3y64 commented Nov 4, 2019

Thanks wlandau you are killing it. Dynamic branching is here! This is a big improvement for large plan.

Now we have two choices of static and dynamic branching. One thing now I am confuse that, as mentioned in the new chapter of manual, with all the benefits of dynamic branching, from performance improvement of drake_config and outdated to flexibility of using unknown targets in advance, and now even dynamic branching can be used for static targets?

# from drake manual
plan <- drake_plan(
  static_numbers = seq_len(2),
  static_letters = c("a", "b"),
  dynamic_lowercase = target(
    paste0(static_numbers, static_letters),
    dynamic = map(static_numbers, static_letters)
  ),
  dynamic_uppercase = target(
    toupper(dynamic_lowercase),
    dynamic = map(dynamic_lowercase)
  )
)

What are the pros and cons of static and dynamic branching and how do we choose between them? Can dynamic branching be a drop-in replacement for static branching in most cases?

@wlandau
Copy link
Member Author

wlandau commented Nov 4, 2019

With dynamic branching, make() is faster to initialize, and you have far more flexibility. With static branching, you have meaningful target names, and it is easier to predict what the plan is going to do in advance. I think that is the crux of it. As you say, there is a ton of room for overlap and personal judgement.

In fact, you can even use static and dynamic branching together! If you do, I recommend static map() for small expansions and dynamic map() for large expansions. Sketch:

library(drake)

plan <- drake_plan(
  locations = get_location_db(), # many rows
  coordinates = target(
    my_geocode_fn(locations, method),
    dynamic = map(locations),
    transform = map(method = c("address", "postal", "boundary"))
  ),
  results = target(
    bind_rows(coordinates),
    dynamic = combine(coordinates),
    transform = combine(coordinates)
  )
)

plot(plan)

Created on 2019-11-04 by the reprex package (v0.3.0)

@wlandau
Copy link
Member Author

wlandau commented Nov 4, 2019

And thanks for the question, @mik3y64. I think this is a good one for the manual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants