-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mutate() duplicate the SQL command when assigning back to the same column #605
Comments
I did some more digging into this issue and it's a lot more serious than I first thought and I have some guidance on where to look for the solution. The problem seems to be strictly related to using rlang. Consider the following 2 examples (output shortened for brevity): sc <- sparklyr::spark_connect(master = "local")
mtcars_spark <- dplyr::copy_to(sc, mtcars, "mtcars")
# This is fine
mtcars_spark %>% mutate(mpg = mpg * 2)
# # Source: spark<?> [?? x 11]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 42 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 42 6 160 110 3.9 2.88 17.0 0 1 4 4
# # ... with more rows
# This is also fine
mtcars_spark %>% mutate(mpg2 = mpg * 2)
# # Source: spark<?> [?? x 12]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 42
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 42
# # ... with more rows The issue starts when we use rlang to reassign the column name back onto itself. col <- "mpg"
mtcars_spark %>% mutate(!!col := !!rlang::sym(col) * 2)
# # Source: spark<?> [?? x 11]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 84 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 84 6 160 110 3.9 2.88 17.0 0 1 4 4
# # ... with more rows As you can see, the mpg column has been summed twice. So let's take a look: sparklyr:::mutate.tbl_spark <- function (.data, ...) {
dots <- rlang::enquos(..., .named = TRUE) %>% fix_na_real_values()
do.call(NextMethod, dots)
} In the above example which uses rlang, the value of dots
# <list_of<quosure>>
#
# $mpg
# <quosure>
# expr: ^mpg * 2
# env: 0x55d394b19460 So now we have dbplyr:::mutate.tbl_lazy <- function (.data, ...) {
dots <- quos(..., .named = TRUE)
dots <- partial_eval_dots(dots, vars = op_vars(.data))
nest_vars(.data, dots, union(op_vars(.data), op_grps(.data)))
} The value of dots
# $mpg
# <quosure>
# expr: ^mpg * 2
# env: 0x55d3934a4e00
#
# $mpg
# <quosure>
# expr: ^mpg * 2
# env: 0x55d3934a4e00 And so the calculation is performed twice on the same column. We can see the from the rendered SQL code too: mtcars_spark %>% mutate(!!col := !!rlang::sym(col) * 2) %>% dbplyr::sql_render()
# <SQL> SELECT `mpg` * 2.0 AS `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`
# FROM (SELECT `mpg` * 2.0 AS `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`
# FROM `mtcars`) `q01` If we aren't assigning to the same column, however, this seems to "fix" the issue: col <- "mpg2"
mtcars_spark %>% mutate(!!col := mpg * 2)
# # Source: spark<?> [?? x 12]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 42
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 42 The SQL which proves the column isn't being calculated twice: mtcars_spark %>% mutate(!!col := mpg * 2) %>% dbplyr::sql_render()
# <SQL> SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `mpg` * 2.0 AS `mpg2`
# FROM `mtcars` |
I did some more digging and it seems that this is a sparklyr issue, not a dbplyr issue. It was introduced here: sparklyr/sparklyr@2af35d0. FWIW, I can overload this method and the #' @export
mutate.tbl_spark <- function(.data, ...) {
NextMethod("mutate", .data)
} Tagging @yitao-li so he knows about this. |
I tried out the dev version of sparklyr and this issue was fixed in sparklyr/sparklyr@a84e00d however the problem still persists in |
I can confirm this is a sparklyr issue as well. EDIT: it has been fixed with sparklyr/sparklyr#2960 |
See the following example:
For
tbl_spark
(and possiblytbl_lazy
) objects, the function to be applied byacross()
andmutate_at()
occurs twice:For
data.frame
s the summation is calculated correctly:System info:
The text was updated successfully, but these errors were encountered: