problems reading from .zip but not from decompressed file #331

benzipperer · 2021-04-29T07:06:59Z

Hi, thank you for the excellent package! I am having some problems with expected != actual columns after reading a .csv file. I've narrowed it down to only experiencing the problems reading the original compressed .zip archive of the file, rather than the manually decompressed .csv file.

library(vroom)

# download zipped data
url <- "https://data.bls.gov/cew/data/files/2017/csv/2017_qtrly_singlefile.zip"
temp_dir <- tempdir()
data_zip <- tempfile(tmpdir = temp_dir, fileext = ".zip")
download.file(url, data_zip)

# decompress data
csv_basename <- unzip(data_zip, list=TRUE)$Name[1]
unzip(data_zip, files = csv_basename, exdir = temp_dir, overwrite=TRUE)
data_csv <- file.path(temp_dir, csv_basename)

# confirm problems with zip file only
vroom_from_zip <- vroom(data_zip)
problems(vroom_from_zip)
# A tibble: 1,058 x 5
     row   col expected   actual     file 
   <int> <int> <chr>      <chr>      <chr>
 1     2    33 42 columns 33 columns ""   
 2     2    33 42 columns 33 columns ""   
 3     2    33 42 columns 33 columns ""   
 4     2    33 42 columns 33 columns ""   
 5     2    33 42 columns 33 columns ""   
 6     2    33 42 columns 33 columns ""   
 7     2    33 42 columns 33 columns ""   
 8     2    33 42 columns 33 columns ""   
 9     2    33 42 columns 33 columns ""   
10     2    33 42 columns 33 columns ""   
# … with 1,048 more rows

vroom_from_csv <- vroom(data_csv)
problems(vroom_from_csv)
# A tibble: 0 x 5
# … with 5 variables: row <int>, col <int>, expected <chr>,
#   actual <chr>, file <chr>

Do you have any insights?

jimhester · 2021-04-29T18:22:52Z

Thank you for opening the issue and for supplying a reproducible example, it is a big help!

This was a recent regression when reading files from connections with windows line endings when the bytes of the line ending spanned two different connection buffers.

It should now be fixed.

jimhester closed this as completed in 5fc54e6 Apr 29, 2021

jimhester mentioned this issue Apr 29, 2021

Memory usage of read_csv_chunked() in conjunction with a gzip compressed file tidyverse/readr#1200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems reading from .zip but not from decompressed file #331

problems reading from .zip but not from decompressed file #331

benzipperer commented Apr 29, 2021 •

edited

Loading

jimhester commented Apr 29, 2021

problems reading from .zip but not from decompressed file #331

problems reading from .zip but not from decompressed file #331

Comments

benzipperer commented Apr 29, 2021 • edited Loading

jimhester commented Apr 29, 2021

benzipperer commented Apr 29, 2021 •

edited

Loading