Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwriting an imported file will segfault (or garble the file) #1273

Closed
knokknok opened this issue Aug 14, 2021 · 4 comments
Closed

Overwriting an imported file will segfault (or garble the file) #1273

knokknok opened this issue Aug 14, 2021 · 4 comments

Comments

@knokknok
Copy link

The following will crash on my computer (or mess up the file)

library(readr)

file <- file.path("/tmp/tmp.csv")
write.csv(cars, file)

data <- read_csv(file)
data$speed <- 2*data$speed
write.csv(data, file=file)

sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_2.0.1

loaded via a namespace (and not attached):
 [1] compiler_4.1.1  magrittr_2.0.1  ellipsis_0.3.2  R6_2.5.0
 [5] hms_1.1.0       pillar_1.6.2    tibble_3.1.3    crayon_1.4.1
 [9] utf8_1.2.2      fansi_0.5.0     vctrs_0.3.8     tzdb_0.1.2
[13] lifecycle_1.0.0 pkgconfig_2.0.3 rlang_0.4.11
@martin-rueegg
Copy link

(For solution/work-around see further down this comment under the section "Solution".)

Not sure this is related, but when I try to write to an previously imported file, I get the following error:

Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :

This happens when I use readr::read_csv and utils::write.csv

The stack trace is

11.
file(file, ifelse(append, "a", "w")) 
10.
write.table(COMICS1.ALL.df, COMICS1_KO_and_tara.ALL.csv, row.names = FALSE, 
    col.names = TRUE, sep = ",", dec = ".", qmethod = "double") 
9.
eval(expr, p) 
8.
eval(expr, p) 
7.
eval.parent(Call) 
6.
write.csv(COMICS1.ALL.df, COMICS1_KO_and_tara.ALL.csv, row.names = FALSE) at COMICS1_KO_and_tara.ALL.R#310
5.
load.COMICS1.fn(force = 0, update = TRUE) at COMICS1_KO_and_tara.ALL.R#371
4.
eval(ei, envir) 
3.
eval(ei, envir) 
2.
withVisible(eval(ei, envir)) 
1.
source(paste0(comics1.1.KO_Tax.path, "/All kos/COMICS1_KO_and_tara.ALL.R")) 

What is interesting, is, that at the very moment, the file is open by and only by the very rsession.exe. So it seems to have to do with lazy loading!

When using readr::read_csv2 , the read seems fail when merging the imported data frame with another one: The merge never happens, instead the rsession.exe crashes and restarts continuously.

Solution

When I use

lazy = FALSE

# as in ...
readr::read_csv(file, lazy = FALSE)

then the error does not happen.

So it would be good to have that as a warning in the documentation, that one needs to disable lazy loading if one intends to update/overwrite the file later.

Info

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.0 ggplot2_3.3.5     digest_0.6.27     tibble_3.1.3      tidyr_1.1.3       dplyr_1.0.7      
[7] readr_2.0.1      

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13  magrittr_2.0.1   hms_1.1.0        bit_4.0.4        munsell_0.5.0    tidyselect_1.1.1 colorspace_2.0-2
 [8] R6_2.5.0         rlang_0.4.11     fansi_0.5.0      stringr_1.4.0    tools_4.1.0      parallel_4.1.0   grid_4.1.0      
[15] vroom_1.5.4      gtable_0.3.0     utf8_1.2.2       withr_2.4.2      askpass_1.1      ellipsis_0.3.2   bit64_4.0.5     
[22] openssl_1.4.4    lifecycle_1.0.0  crayon_1.4.1     purrr_0.3.4      tzdb_0.1.2       vctrs_0.3.8      glue_1.4.2      
[29] stringi_1.7.3    compiler_4.1.0   pillar_1.6.2     scales_1.1.1     generics_0.1.0   pkgconfig_2.0.3 

@martin-rueegg
Copy link

On a further note:

Since the default for readr::read_delim seems to be lazy = TRUE there are other undesired side effects:

Since readr::read_delim is actually locking the file, any move/rename or update through an external source (be it that the user manually updates the csv file or it gets synchronized from a server), all those actions fail.

We had repeating issues with this but only now I realize that they relate to this lazy loading!

While I do understand the choice for the default to load lazily, I would not assume that this is whats happening. I would strongly suggest to re-evaluate this decision and appeal to switch to non-lazy loading by default. There could be an option to introduce an alternative verb, e.g. readr::open_delim (with its derivates readr::open_csv. readr::open_tsv etc.) that would have lazy loading as the default. Read, as opposed to open does not suggest that the file keeps open and locked in the background.

Another option would to show a warning message when using the default setting lazy = NULL (the new default, which equals to TRUE, but with the warning message). Setting it explicitly to "TRUE" would disable the warning.

@martin-rueegg
Copy link

Hi @jimhester

Thanks for updating the documentation. I have not seen that in the documentation before - apparently it was already there. The problem is that you find the error while writing, not while reading the file. And you cannot always read the entire documentation to find such an important side-effect.

My suggestion would be to either

  • change the default (probably not the best idea),
  • give out a warning while reading with no lazy parameter set (that can be disabled by passing lazy = TRUE; which would be my favorite), or
  • put the warning at a prominent page at the top of the documentation page (does still not solve the problem of finding that information while writing).

Would you accept a PR for implementing the warning message?

@Shians
Copy link

Shians commented Aug 18, 2021

Agree with the above, it's a major side-effect that arguably breaks the drop-in quality of the read_* functions for the base and data.table equivalents. I think there's a case to be made for least surprise vs maximum performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants