Dynamic branching #1042

wlandau · 2019-11-03T00:40:08Z

Summary

At long last, drake is capable of dynamic branching! You can now define new targets while make() is running.

Here is an demo that fits GDP vs year for each country in the Gapminder dataset. We let the data decide how many targets we have and what those targets mean.

library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)
library(gapminder) # https://github.com/jennybc/gapminder

# We need a function to fit the model.
fit_model <- function(dataset, country) {
  dataset %>%
    filter(country == !!country) %>% # The !! is important.
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(country = !!country)
}

plan <- drake_plan(
  # This dataset can change, and we want
  # the downstream targets to update.
  dataset = gapminder,
  
  # We need a grouping variable for our dynamic
  # targets (e.g. models).
  country = unique(dataset$country),
  
  # Fit GDP vs year for each country.
  model = target(
    fit_model(dataset, country),
    dynamic = map(country)
  ),
  
  # Get the continent of each model's country.
  continent = dataset %>%
    distinct(continent, country) %>%
    pull(continent),
  
  # Aggregate the models by continent.
  results = target(
    bind_rows(model) %>%
      mutate(continent = continent),
    dynamic = combine(model, .by = continent)
  )
)

# The targets have cryptic names,
# but there are good reasons to use hashes as suffixes.
# I will discuss if someone wants to.
make(plan)
#> target dataset
#> target country
#> target continent
#> subtarget model_94c60fa4
#> subtarget model_d80d49db
#> subtarget model_1dc4c7bc
#> subtarget model_63e3125f
#> subtarget model_29837144
#> subtarget model_cdf5fd51
#> subtarget model_4c0e0b94
#> subtarget model_973db4f8
#> subtarget model_d61b934b
#### Lots more models, manually cropped.
#> subtarget results_d8941ed2
#> subtarget results_57a69204
#> subtarget results_100845a9
#> subtarget results_265c50a5
#> subtarget results_f4c57f21

# loadd() and readd() understand dynamic targets,
# so you do not need to deal with those funny names.
readd(results, subtargets = seq_len(2))
#> [[1]]
#> # A tibble: 66 x 7
#>    term          estimate std.error statistic  p.value country    continent
#>    <chr>            <dbl>     <dbl>     <dbl>    <dbl> <fct>      <fct>    
#>  1 (Intercep…    1675.      3747.       0.447  6.64e-1 Afghanist… Asia     
#>  2 year            -0.441      1.89    -0.233  8.21e-1 Afghanist… Asia     
#>  3 (Intercep… -535201.     68897.      -7.77   1.52e-5 Bahrain    Asia     
#>  4 year           280.        34.8      8.03   1.14e-5 Bahrain    Asia     
#>  5 (Intercep…  -19975.      4837.      -4.13   2.05e-3 Bangladesh Asia     
#>  6 year            10.5        2.44     4.30   1.57e-3 Bangladesh Asia     
#>  7 (Intercep…  -30194.      7822.      -3.86   3.16e-3 Cambodia   Asia     
#>  8 year            15.6        3.95     3.95   2.75e-3 Cambodia   Asia     
#>  9 (Intercep… -127509.     24515.      -5.20   4.01e-4 China      Asia     
#> 10 year            65.2       12.4      5.26   3.67e-4 China      Asia     
#> # … with 56 more rows
#> 
#> [[2]]
#> # A tibble: 60 x 7
#>    term      estimate std.error statistic  p.value country        continent
#>    <chr>        <dbl>     <dbl>     <dbl>    <dbl> <fct>          <fct>    
#>  1 (Interc…   -1.05e5   23496.      -4.45 1.24e- 3 Albania        Europe   
#>  2 year        5.45e1      11.9      4.59 9.98e- 4 Albania        Europe   
#>  3 (Interc…   -1.04e6   20543.     -50.5  2.23e-13 Austria        Europe   
#>  4 year        5.35e2      10.4     51.5  1.84e-13 Austria        Europe   
#>  5 (Interc…   -8.98e5   24093.     -37.3  4.59e-12 Belgium        Europe   
#>  6 year        4.64e2      12.2     38.1  3.69e-12 Belgium        Europe   
#>  7 (Interc…   -1.92e5   27479.      -7.00 3.73e- 5 Bosnia and He… Europe   
#>  8 year        9.89e1      13.9      7.12 3.20e- 5 Bosnia and He… Europe   
#>  9 (Interc…   -2.07e5   46116.      -4.48 1.17e- 3 Bulgaria       Europe   
#> 10 year        1.08e2      23.3      4.62 9.48e- 4 Bulgaria       Europe   
#> # … with 50 more rows

# The graphs for large workflows are much simpler to look at now.
config <- drake_config(plan)
vis_drake_graph(config)

# Side note: plans have a plot method now.
plot(plan)

^{Created on 2019-11-03 by the reprex package (v0.3.0)}

Performance

Dynamic branching makes drake_config() and outdated() much faster, which makes make() much faster to initialize. However, when it comes to building targets, the per-target overhead for dynamic sub-targets can be twice that of static branching. Not such a big deal for a first implementation. After all, it took a long time to get static branching as fast as it is. But performance will always be a long-term project.

Related GitHub issues and pull requests

Ref: Dynamic branching #685

Checklist

I understand and agree to drake's code of conduct.
I have listed any substantial changes in the development news.
I have added testthat unit tests to tests/testthat for any new functionality.
This pull request is not a draft.

Based on the global seed, a user-supplied target seed, and the subtarget name.

This was a performance bottleneck

codecov-io · 2019-11-03T01:01:22Z

Codecov Report

Merging #1042 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #1042    +/-   ##
=======================================
  Coverage     100%    100%            
=======================================
  Files          47      47            
  Lines        6737    7161   +424     
=======================================
+ Hits         6737    7161   +424

Impacted Files	Coverage Δ
R/transform_plan.R	`100% <ø> (ø)`	⬆️
R/manage_memory.R	`100% <100%> (ø)`	⬆️
R/drake_meta_.R	`100% <100%> (ø)`	⬆️
R/handle_triggers.R	`100% <100%> (ø)`	⬆️
R/hpc.R	`100% <100%> (ø)`	⬆️
R/drake_plan_helpers.R	`100% <100%> (ø)`	⬆️
R/drake_config.R	`100% <100%> (ø)`	⬆️
R/local_build.R	`100% <100%> (ø)`	⬆️
R/dynamic.R	`100% <100%> (ø)`	⬆️
R/drake_graph_info.R	`100% <100%> (ø)`	⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4aa702...c809e2b. Read the comment docs.

wlandau · 2019-11-03T01:31:30Z

More details and discussion: #685 (comment)

wlandau · 2019-11-03T01:35:28Z

Forgot to mention: for anything without a dim(), we map over the elements. For anything with a dim(), we map over the first margin. This makes it super easy to map over the rows.

library(drake)
library(tibble)
plan <- drake_plan(
  x = as_tibble(mtcars),
  y = target(x, dynamic = map(x))
)

make(plan, verbose = 0L)

readd(y, subtargets = 1:2)
#> [[1]]
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21     6   160   110   3.9  2.62  16.5     0     1     4     4
#> 
#> [[2]]
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21     6   160   110   3.9  2.88  17.0     0     1     4     4

^{Created on 2019-11-02 by the reprex package (v0.3.0)}

kendonB · 2019-11-03T18:13:04Z

for anything without a dim(), we map over the elements. For anything with a dim(), we map over the first margin. This makes it super easy to map over the rows.

I anticipate that this will cause confusion in the case of data.frames. It might be better to have verbose control over which way the iteration goes. This is especially true for map as the purrr function iterates over [[ i.e. columns in the case of data.frames.

Consider making the map dimension variable with the default being what purrr would do?

kendonB · 2019-11-03T18:34:42Z

This looks really cool!

How would I get the subtarget associated with a particular value of country here?

readd(results, subtargets = seq_len(2))

wlandau · 2019-11-03T19:50:42Z

Re #1042 (comment), I will think about it, but please do not get your hopes up. Row-oriented workflows come up a lot more than looping over columns, so I want the former to take priority. And other workarounds like map_rows() have even more explaining to do when we start thinking about mapping over a mix of data frames and non-data-frames.

Anyway, here is how to get the model for a country and the aggregate for a continent. There is no general way to assign suggestive names during make(), so drake does not even try. (Consistency in all cases is better than convenience in some.) When you do read the sub-target, please take a look at the country or continent column to make sure your read the right value.

library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)
library(gapminder)

fit_model <- function(dataset, country) {
  dataset %>%
    filter(country == !!country) %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(country = !!country)
}

plan <- drake_plan(
  dataset = gapminder,
  country = unique(dataset$country),
  model = target(
    fit_model(dataset, country),
    dynamic = map(country)
  ),
  continent = dataset %>%
    distinct(continent, country) %>%
    pull(continent),
  results = target(
    bind_rows(model) %>%
      mutate(continent = continent),
    dynamic = combine(model, .by = continent)
  )
)

make(plan, verbose = 0L)

countries <- readd(country)
index <- which(countries == "Philippines")
readd(model, subtargets = index)[[1]]
#> # A tibble: 2 x 6
#>   term        estimate std.error statistic    p.value country    
#>   <chr>          <dbl>     <dbl>     <dbl>      <dbl> <fct>      
#> 1 (Intercept) -53730.    6773.       -7.93 0.0000127  Philippines
#> 2 year            28.2      3.42      8.25 0.00000895 Philippines

continents <- unique(readd(continent))
index <- which(continents == "Oceania")
readd(results, subtargets = index)
#> [[1]]
#> # A tibble: 4 x 7
#>   term       estimate std.error statistic      p.value country    continent
#>   <chr>         <dbl>     <dbl>     <dbl>        <dbl> <fct>      <fct>    
#> 1 (Intercep… -824973.   47418.      -17.4      8.34e-9 Australia  Oceania  
#> 2 year           427.      24.0      17.8      6.61e-9 Australia  Oceania  
#> 3 (Intercep… -459336.   26886.      -17.1      9.96e-9 New Zeala… Oceania  
#> 4 year           241.      13.6      17.7      6.96e-9 New Zeala… Oceania

^{Created on 2019-11-03 by the reprex package (v0.3.0)}

wlandau · 2019-11-03T20:54:09Z

FYI, I decided to extend dynamic combine() to work with static targets. Could be useful in non-row-wise data splitting situations.

library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)
library(gapminder)

fit_model <- function(dataset, country) {
  dataset %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(country = !!country)
}

plan <- drake_plan(
  dataset = gapminder,
  country = dataset$country,
  model = target(
    fit_model(dataset, country),
    dynamic = combine(dataset, .by = country)
  )
)

make(plan, verbose = 0L)

readd(model, subtargets = 1)
#> [[1]]
#> # A tibble: 2 x 6
#>   term        estimate std.error statistic p.value country    
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <fct>      
#> 1 (Intercept) 1675.      3747.       0.447   0.664 Afghanistan
#> 2 year          -0.441      1.89    -0.233   0.821 Afghanistan

^{Created on 2019-11-03 by the reprex package (v0.3.0)}

wlandau · 2019-11-04T05:37:13Z

New chapter in the manual: https://ropenscilabs.github.io/drake-manual/dynamic.html

mik3y64 · 2019-11-04T10:39:31Z

Thanks wlandau you are killing it. Dynamic branching is here! This is a big improvement for large plan.

Now we have two choices of static and dynamic branching. One thing now I am confuse that, as mentioned in the new chapter of manual, with all the benefits of dynamic branching, from performance improvement of drake_config and outdated to flexibility of using unknown targets in advance, and now even dynamic branching can be used for static targets?

# from drake manual
plan <- drake_plan(
  static_numbers = seq_len(2),
  static_letters = c("a", "b"),
  dynamic_lowercase = target(
    paste0(static_numbers, static_letters),
    dynamic = map(static_numbers, static_letters)
  ),
  dynamic_uppercase = target(
    toupper(dynamic_lowercase),
    dynamic = map(dynamic_lowercase)
  )
)

What are the pros and cons of static and dynamic branching and how do we choose between them? Can dynamic branching be a drop-in replacement for static branching in most cases?

wlandau · 2019-11-04T14:12:29Z

With dynamic branching, make() is faster to initialize, and you have far more flexibility. With static branching, you have meaningful target names, and it is easier to predict what the plan is going to do in advance. I think that is the crux of it. As you say, there is a ton of room for overlap and personal judgement.

In fact, you can even use static and dynamic branching together! If you do, I recommend static map() for small expansions and dynamic map() for large expansions. Sketch:

library(drake)

plan <- drake_plan(
  locations = get_location_db(), # many rows
  coordinates = target(
    my_geocode_fn(locations, method),
    dynamic = map(locations),
    transform = map(method = c("address", "postal", "boundary"))
  ),
  results = target(
    bind_rows(coordinates),
    dynamic = combine(coordinates),
    transform = combine(coordinates)
  )
)

plot(plan)

^{Created on 2019-11-04 by the reprex package (v0.3.0)}

wlandau · 2019-11-04T14:29:14Z

And thanks for the question, @mik3y64. I think this is a good one for the manual.

wlandau-lilly and others added 30 commits October 22, 2019 11:34

Start work on dynamic builds in local computing

7935f9b

Lints

1f0d872

Change implementation tactics

9658938

Change names

048652c

Rethink S3 backend for dynamic builds

d91974d

Define a generic method

f1ec043

Focus on the number of subtargets

c6b8ce1

Rethink dynamic S3 again

25cc76d

Stop here for now.

fa04281

Rm blank line

ab4e22a

Add comment

1de1ddb

Avoid enclos

a1fe8ae

Spaces

43d95db

Change envir used

758d908

More sketches

70844f2

Repair a test

53740cb

Change the names of drake_config() environments

bfc3578

Lints

bee1316

Write note to self

64265e4

Add to do's

e92a7c7

Take a different tactic with local dynamic targets

2eb6ad7

Minor adjustment

eed251e

Register dynamic targets

e55055e

Move garbage collection

b929570

Move gc

dbdbbd5

Delay

aee9927

Need to augment individual manage_deps methods

4503402

Preliminary work on memory strategies for #685

616556e

Ignore custom seeds for dynamic targets

20cadce

Set a reasonable seed for dynamic subtargets

613baa9

Based on the global seed, a user-supplied target seed, and the subtarget name.

wlandau-lilly added 5 commits November 2, 2019 17:16

Fix a tricky bug in updating the priority queue

3681ed9

Fix a bug in the hpc testing suite

f4ae0fc

Do not excessively invalidate sub-targets

66d03f0

This was a performance bottleneck

Enhance a test

e74ca9f

Reduce logging

c893160

wlandau self-assigned this Nov 3, 2019

wlandau added status: priority difficulty: advanced topic: api labels Nov 3, 2019

Bump version

c809e2b

wlandau mentioned this pull request Nov 3, 2019

Dynamic branching ropensci-books/drake#126

Closed

wlandau mentioned this pull request Nov 3, 2019

Dynamic branching #685

Closed

wlandau merged commit 5766e3a into master Nov 3, 2019

wlandau deleted the 685 branch November 3, 2019 01:31

wlandau mentioned this pull request Nov 3, 2019

drake friendly alternative to dplyr::do #77

Closed

16 tasks

wlandau mentioned this pull request Nov 9, 2019

Dynamic trace and helper functions #1055

Merged

4 tasks

wlandau mentioned this pull request Jan 10, 2020

Improved infrastructure for drake_envir() #1132

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic branching #1042

Dynamic branching #1042

wlandau commented Nov 3, 2019 •

edited

Loading

codecov-io commented Nov 3, 2019 •

edited

Loading

wlandau commented Nov 3, 2019 •

edited

Loading

wlandau commented Nov 3, 2019

kendonB commented Nov 3, 2019 •

edited

Loading

kendonB commented Nov 3, 2019

wlandau commented Nov 3, 2019

wlandau commented Nov 3, 2019

wlandau commented Nov 4, 2019

mik3y64 commented Nov 4, 2019 •

edited

Loading

wlandau commented Nov 4, 2019 •

edited

Loading

wlandau commented Nov 4, 2019

Dynamic branching #1042

Dynamic branching #1042

Conversation

wlandau commented Nov 3, 2019 • edited Loading

Summary

Performance

Related GitHub issues and pull requests

Checklist

codecov-io commented Nov 3, 2019 • edited Loading

Codecov Report

wlandau commented Nov 3, 2019 • edited Loading

wlandau commented Nov 3, 2019

kendonB commented Nov 3, 2019 • edited Loading

kendonB commented Nov 3, 2019

wlandau commented Nov 3, 2019

wlandau commented Nov 3, 2019

wlandau commented Nov 4, 2019

mik3y64 commented Nov 4, 2019 • edited Loading

wlandau commented Nov 4, 2019 • edited Loading

wlandau commented Nov 4, 2019

wlandau commented Nov 3, 2019 •

edited

Loading

codecov-io commented Nov 3, 2019 •

edited

Loading

wlandau commented Nov 3, 2019 •

edited

Loading

kendonB commented Nov 3, 2019 •

edited

Loading

mik3y64 commented Nov 4, 2019 •

edited

Loading

wlandau commented Nov 4, 2019 •

edited

Loading