Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems reading from .zip but not from decompressed file #331

Closed
benzipperer opened this issue Apr 29, 2021 · 1 comment
Closed

problems reading from .zip but not from decompressed file #331

benzipperer opened this issue Apr 29, 2021 · 1 comment

Comments

@benzipperer
Copy link

benzipperer commented Apr 29, 2021

Hi, thank you for the excellent package! I am having some problems with expected != actual columns after reading a .csv file. I've narrowed it down to only experiencing the problems reading the original compressed .zip archive of the file, rather than the manually decompressed .csv file.

library(vroom)

# download zipped data
url <- "https://data.bls.gov/cew/data/files/2017/csv/2017_qtrly_singlefile.zip"
temp_dir <- tempdir()
data_zip <- tempfile(tmpdir = temp_dir, fileext = ".zip")
download.file(url, data_zip)

# decompress data
csv_basename <- unzip(data_zip, list=TRUE)$Name[1]
unzip(data_zip, files = csv_basename, exdir = temp_dir, overwrite=TRUE)
data_csv <- file.path(temp_dir, csv_basename)

# confirm problems with zip file only
vroom_from_zip <- vroom(data_zip)
problems(vroom_from_zip)
# A tibble: 1,058 x 5
     row   col expected   actual     file 
   <int> <int> <chr>      <chr>      <chr>
 1     2    33 42 columns 33 columns ""   
 2     2    33 42 columns 33 columns ""   
 3     2    33 42 columns 33 columns ""   
 4     2    33 42 columns 33 columns ""   
 5     2    33 42 columns 33 columns ""   
 6     2    33 42 columns 33 columns ""   
 7     2    33 42 columns 33 columns ""   
 8     2    33 42 columns 33 columns ""   
 9     2    33 42 columns 33 columns ""   
10     2    33 42 columns 33 columns ""   
# … with 1,048 more rows

vroom_from_csv <- vroom(data_csv)
problems(vroom_from_csv)
# A tibble: 0 x 5
# … with 5 variables: row <int>, col <int>, expected <chr>,
#   actual <chr>, file <chr>

Do you have any insights?

@jimhester
Copy link
Collaborator

Thank you for opening the issue and for supplying a reproducible example, it is a big help!

This was a recent regression when reading files from connections with windows line endings when the bytes of the line ending spanned two different connection buffers.

It should now be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants