Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutate() duplicate the SQL command when assigning back to the same column #605

Closed
nathaneastwood opened this issue Mar 2, 2021 · 4 comments

Comments

@nathaneastwood
Copy link

nathaneastwood commented Mar 2, 2021

See the following example:

For tbl_spark (and possibly tbl_lazy) objects, the function to be applied by across() and mutate_at() occurs twice:

sc <- sparklyr::spark_connect(master = "local")
mtcars_spark <- dplyr::copy_to(sc, mtcars, "mtcars")
dbplyr_res <- dplyr::mutate(
  .data = mtcars_spark,
  dplyr::across(.cols = "mpg", .fns = sum)
)
dbplyr_res
# # Source: spark<?> [?? x 11]
#       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1 20573.     6  160    110  3.9   2.62  16.5     0     1     4     4
#  2 20573.     6  160    110  3.9   2.88  17.0     0     1     4     4
#  3 20573.     4  108     93  3.85  2.32  18.6     1     1     4     1
#  4 20573.     6  258    110  3.08  3.22  19.4     1     0     3     1
#  5 20573.     8  360    175  3.15  3.44  17.0     0     0     3     2
#  6 20573.     6  225    105  2.76  3.46  20.2     1     0     3     1
#  7 20573.     8  360    245  3.21  3.57  15.8     0     0     3     4
#  8 20573.     4  147.    62  3.69  3.19  20       1     0     4     2
#  9 20573.     4  141.    95  3.92  3.15  22.9     1     0     4     2
# 10 20573.     6  168.   123  3.92  3.44  18.3     1     0     4     4
# # … with more rows

dbplyr_res %>% dbplyr::sql_render()                                                                                                                                        
# <SQL> SELECT SUM(`mpg`) OVER () AS `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`
# FROM (SELECT SUM(`mpg`) OVER () AS `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`
# FROM `mtcars`) `q01`

For data.frames the summation is calculated correctly:

dplyr::mutate(
  .data = mtcars,
  dplyr::across(.cols = "mpg", .fns = sum)
)
#                       mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4           642.9   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag       642.9   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710          642.9   4 108.0  93 3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive      642.9   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout   642.9   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Valiant             642.9   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# Duster 360          642.9   8 360.0 245 3.21 3.570 15.84  0  0    3    4
# Merc 240D           642.9   4 146.7  62 3.69 3.190 20.00  1  0    4    2
# Merc 230            642.9   4 140.8  95 3.92 3.150 22.90  1  0    4    2
# ...

System info:

packageVersion("dbplyr")                                                                                                                                          
# [1] ‘2.1.0’
packageVersion("dplyr")                                                                                                                                           
# [1] ‘1.0.4’
@nathaneastwood
Copy link
Author

I did some more digging into this issue and it's a lot more serious than I first thought and I have some guidance on where to look for the solution. The problem seems to be strictly related to using rlang. Consider the following 2 examples (output shortened for brevity):

sc <- sparklyr::spark_connect(master = "local")
mtcars_spark <- dplyr::copy_to(sc, mtcars, "mtcars")

# This is fine
mtcars_spark %>% mutate(mpg = mpg * 2)
# # Source: spark<?> [?? x 11]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  42       6  160    110  3.9   2.62  16.5     0     1     4     4
#  2  42       6  160    110  3.9   2.88  17.0     0     1     4     4
# # ... with more rows

# This is also fine
mtcars_spark %>% mutate(mpg2 = mpg * 2)
# # Source: spark<?> [?? x 12]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  mpg2
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4  42  
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4  42
# # ... with more rows

The issue starts when we use rlang to reassign the column name back onto itself.

col <- "mpg"
mtcars_spark %>% mutate(!!col := !!rlang::sym(col) * 2)
# # Source: spark<?> [?? x 11]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  84       6  160    110  3.9   2.62  16.5     0     1     4     4
#  2  84       6  160    110  3.9   2.88  17.0     0     1     4     4
# # ... with more rows

As you can see, the mpg column has been summed twice. So let's take a look:

sparklyr:::mutate.tbl_spark <- function (.data, ...) {
  dots <- rlang::enquos(..., .named = TRUE) %>% fix_na_real_values()
  do.call(NextMethod, dots)
}

In the above example which uses rlang, the value of dots is

dots
# <list_of<quosure>>
# 
# $mpg
# <quosure>
# expr: ^mpg * 2
# env:  0x55d394b19460

So now we have dbplyr:::mutate.tbl_lazy() as the next method

dbplyr:::mutate.tbl_lazy <- function (.data, ...) {
  dots <- quos(..., .named = TRUE)
  dots <- partial_eval_dots(dots, vars = op_vars(.data))
  nest_vars(.data, dots, union(op_vars(.data), op_grps(.data)))
}

The value of dots here is:

dots                                                                                                                                                       
# $mpg
# <quosure>
# expr: ^mpg * 2
# env:  0x55d3934a4e00
# 
# $mpg
# <quosure>
# expr: ^mpg * 2
# env:  0x55d3934a4e00

And so the calculation is performed twice on the same column. We can see the from the rendered SQL code too:

mtcars_spark %>% mutate(!!col := !!rlang::sym(col) * 2) %>% dbplyr::sql_render()
# <SQL> SELECT `mpg` * 2.0 AS `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`
# FROM (SELECT `mpg` * 2.0 AS `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`
# FROM `mtcars`) `q01`

If we aren't assigning to the same column, however, this seems to "fix" the issue:

col <- "mpg2"
mtcars_spark %>% mutate(!!col := mpg * 2)
# # Source: spark<?> [?? x 12]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  mpg2
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4  42  
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4  42

The SQL which proves the column isn't being calculated twice:

mtcars_spark %>% mutate(!!col := mpg * 2) %>% dbplyr::sql_render()
# <SQL> SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `mpg` * 2.0 AS `mpg2`
# FROM `mtcars`

@nathaneastwood nathaneastwood changed the title across() and mutate_at() duplicate the SQL command mutate() duplicate the SQL command when assigning back to the same column Mar 4, 2021
@nathaneastwood
Copy link
Author

I did some more digging and it seems that this is a sparklyr issue, not a dbplyr issue. It was introduced here: sparklyr/sparklyr@2af35d0. FWIW, I can overload this method and the tbl_lazy() method works as expected

#' @export
mutate.tbl_spark <- function(.data, ...) {
  NextMethod("mutate", .data)
}

Tagging @yitao-li so he knows about this.

@nathaneastwood
Copy link
Author

I tried out the dev version of sparklyr and this issue was fixed in sparklyr/sparklyr@a84e00d however the problem still persists in transmute(). I issues a PR to temporarily fix it.

@yitao-li
Copy link
Contributor

yitao-li commented Mar 8, 2021

I can confirm this is a sparklyr issue as well.

EDIT: it has been fixed with sparklyr/sparklyr#2960

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants