Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to write a tibble for non-R software to use later? #202

Closed
lcolladotor opened this issue Dec 5, 2016 · 8 comments
Closed

How to write a tibble for non-R software to use later? #202

lcolladotor opened this issue Dec 5, 2016 · 8 comments

Comments

@lcolladotor
Copy link

Hi,

After @jennybc's talk today at JHU Biostats about tibble I'm curious if you have figured out a way to export a tibble (or tbl_df #82) to a disk file that can be used by other languages. What do you typically do in these situations? You write it out to a json file? Ideally that same file could then be read into a tibble if someone wanted to.

My particular use case is recount::all_metadata('tcga') which loads the data.frame at https://github.com/leekgroup/recount-website/blob/master/metadata/metadata_clean_tcga.Rdata. That data.frame has 5 list columns and it's currently problematic for us to write it to disk in a tsv format using write.table(). That's why I deleted these 5 columns from the tsv file we have at http://duffel.rail.bio/recount/TCGA/TCGA.tsv.

Best,
Leo

Clean code

library('tibble')
## From ?as_tibble
l <- list(x = 1:500, y = runif(500), z = 500:1)
df <- as_tibble(l)

## Add a list column
df$l <- lapply(1:500, seq_len)

## Convert to a regular data.frame
df2 <- as.data.frame(df)
head(df2)

## Try writing
write.table(df, file = 'tibble.txt', sep = '\t', row.names = FALSE)
write.table(df2, file = 'tibble_data_frame.txt', sep = '\t', row.names = FALSE)

## Try with readr::write_csv
library('readr')
write_csv(df, 'tibble.csv')
write_csv(df2, 'tibble.csv')

## Try via a matrix as recommended at http://stackoverflow.com/questions/24829027/unimplemented-type-list-when-trying-to-write-table
df3 <- as.matrix(df2)
write.table(df3, file = 'tibble_matrix.txt', sep = '\t', row.names = FALSE)

## Not what I wanted since it summarized the list info, just like as.character() does
system('head tibble_matrix.txt')

## Session info
options(width = 120)
devtools::session_info()

Evaluated code

> library('tibble')
> ## From ?as_tibble
> l <- list(x = 1:500, y = runif(500), z = 500:1)
> df <- as_tibble(l)
> 
> ## Add a list column
> df$l <- lapply(1:500, seq_len)
> 
> ## Convert to a regular data.frame
> df2 <- as.data.frame(df)
> head(df2)
  x         y   z                l
1 1 0.9452182 500                1
2 2 0.8506119 499             1, 2
3 3 0.3301469 498          1, 2, 3
4 4 0.4891011 497       1, 2, 3, 4
5 5 0.2724660 496    1, 2, 3, 4, 5
6 6 0.8284255 495 1, 2, 3, 4, 5, 6
> 
> ## Try writing
> write.table(df, file = 'tibble.txt', sep = '\t', row.names = FALSE)
Error in write.table(df, file = "tibble.txt", sep = "\t", row.names = FALSE) : 
  unimplemented type 'list' in 'EncodeElement'
> write.table(df2, file = 'tibble_data_frame.txt', sep = '\t', row.names = FALSE)
Error in write.table(df2, file = "tibble_data_frame.txt", sep = "\t",  : 
  unimplemented type 'list' in 'EncodeElement'
> 
> ## Try with readr::write_csv
> library('readr')
> write_csv(df, 'tibble.csv')
Error in evalq(sys.calls(), <environment>) : 
  Don't know how to handle vector of type list.
> write_csv(df2, 'tibble.csv')
Error in evalq(sys.calls(), <environment>) : 
  Don't know how to handle vector of type list.
> 
> ## Try via a matrix as recommended at http://stackoverflow.com/questions/24829027/unimplemented-type-list-when-trying-to-write-table
> df3 <- as.matrix(df2)
> write.table(df3, file = 'tibble_matrix.txt', sep = '\t', row.names = FALSE)
> 
> ## Not what I wanted since it summarized the list info, just like as.character() does
> system('head tibble_matrix.txt')
"x"	"y"	"z"	"l"
1	0.94521824317053	500	1
2	0.850611877627671	499	1:2
3	0.330146930413321	498	1:3
4	0.489101073006168	497	1:4
5	0.272465996677056	496	1:5
6	0.82842547330074	495	1:6
7	0.596796503523365	494	1:7
8	0.285002102376893	493	1:8
9	0.387933327816427	492	1:9
> 
> ## Session info
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                             
 version  R Under development (unstable) (2016-10-26 r71594)
 system   x86_64, darwin13.4.0                              
 ui       AQUA                                              
 language (EN)                                              
 collate  en_US.UTF-8                                       
 tz       America/New_York                                  
 date     2016-12-05                                        

Packages ---------------------------------------------------------------------------------------------------------------
 package    * version date       source        
 assertthat   0.1     2013-12-06 CRAN (R 3.4.0)
 devtools     1.12.0  2016-06-24 CRAN (R 3.4.0)
 digest       0.6.10  2016-08-02 CRAN (R 3.4.0)
 memoise      1.0.0   2016-01-29 CRAN (R 3.4.0)
 Rcpp         0.12.8  2016-11-17 CRAN (R 3.4.0)
 readr      * 1.0.0   2016-08-03 CRAN (R 3.4.0)
 tibble     * 1.2     2016-08-26 CRAN (R 3.4.0)
 withr        1.0.2   2016-06-20 CRAN (R 3.4.0)
> 
@krlmlr
Copy link
Member

krlmlr commented Dec 5, 2016

that can be used by other languages

This entirely depends on how this hypothetical "other language" processes nested data frames (or data frames with list columns in general). Which particular language do you have in mind?

For example, for a database that can usually store only atomic values in a column, you would use two tables to store this dataset, linked with a key column.

This question is a bit broad, and the tibble issue tracker isn't the best forum. Would you mind asking at StackOverflow, with a bit more detail about your target language?

@krlmlr krlmlr closed this as completed Dec 5, 2016
@lcolladotor
Copy link
Author

Well, we don't have a target language. That's why we have been using tab separated value text tables.

One way of re-framing my issue would be like this: have you thought of a way to write tibbles to tsv files with some extra markup for list columns that can be used for later reading in that tsv file into a data frame with list columns (or however that's implemented outside R)?

For example, I know R sometimes writes list columns with c(...), which in theory would allow a user to take parse that information into a list column. See https://support.bioconductor.org/p/83911/ for an example.

@krlmlr
Copy link
Member

krlmlr commented Dec 5, 2016

I think tsv is a particularly poor format for this kind of data. If you're interested in portability and durability, use XML or json; I'm not aware of ready-made readers/writers, but I haven't really looked.

@jennybc
Copy link
Member

jennybc commented Dec 5, 2016

I think JSON might be the best way to write this out. It is certainly language agnostic.

This is why I raised this issue a while back, but it was a nonstarter:

tidyverse/readr#303

@lcolladotor
Copy link
Author

Hi again,

Ok, json it is then. You might want to add a quick section at https://github.com/tidyverse/tibble/blob/master/vignettes/tibble.Rmd on writing tibbles to disk. Here's my quick test with the first small example from earlier.

Best,
Leo

Clean code

library('tibble')
## From ?as_tibble
l <- list(x = 1:500, y = runif(500), z = 500:1)
df <- as_tibble(l)

## Add a list column
df$l <- lapply(1:500, seq_len)

## Convert to a json keeping as many digits as possible
library('jsonlite')
json <- toJSON(df, digits = NA)

## Check that it's ok
validate(json)

## Write to file, then read again
write(json, file = 'tibble_to_json.json')
disk <- fromJSON('tibble_to_json.json')

## Convert back to tibble
df2 <- as_tibble(disk)

## Are the tibbles the same?
identical(df, df2)
library('testthat')
expect_equivalent(df, df2)
expect_equal(df, df2)

## Session info
options(width = 120)
devtools::session_info()

Evaluated code

> library('tibble')
> ## From ?as_tibble
> l <- list(x = 1:500, y = runif(500), z = 500:1)
> df <- as_tibble(l)
> 
> ## Add a list column
> df$l <- lapply(1:500, seq_len)
> 
> ## Convert to a json keeping as many digits as possible
> library('jsonlite')
> json <- toJSON(df, digits = NA)
> 
> ## Check that it's ok
> validate(json)
[1] TRUE
> 
> ## Write to file, then read again
> write(json, file = 'tibble_to_json.json')
> disk <- fromJSON('tibble_to_json.json')
> 
> ## Convert back to tibble
> df2 <- as_tibble(disk)
> 
> ## Are the tibbles the same?
> identical(df, df2)
[1] FALSE
> library('testthat')
> expect_equivalent(df, df2)
> expect_equal(df, df2)
> 
> ## Session info
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                             
 version  R Under development (unstable) (2016-10-26 r71594)
 system   x86_64, darwin13.4.0                              
 ui       AQUA                                              
 language (EN)                                              
 collate  en_US.UTF-8                                       
 tz       America/New_York                                  
 date     2016-12-06                                        

Packages ---------------------------------------------------------------------------------------------------------------
 package    * version date       source        
 assertthat   0.1     2013-12-06 CRAN (R 3.4.0)
 crayon       1.3.2   2016-06-28 CRAN (R 3.4.0)
 devtools     1.12.0  2016-06-24 CRAN (R 3.4.0)
 digest       0.6.10  2016-08-02 CRAN (R 3.4.0)
 jsonlite   * 1.1     2016-09-14 CRAN (R 3.4.0)
 magrittr     1.5     2014-11-22 CRAN (R 3.4.0)
 memoise      1.0.0   2016-01-29 CRAN (R 3.4.0)
 R6           2.2.0   2016-10-05 CRAN (R 3.4.0)
 Rcpp         0.12.8  2016-11-17 CRAN (R 3.4.0)
 testthat   * 1.0.2   2016-04-23 CRAN (R 3.4.0)
 tibble     * 1.2     2016-08-26 CRAN (R 3.4.0)
 withr        1.0.2   2016-06-20 CRAN (R 3.4.0)
> 

@krlmlr
Copy link
Member

krlmlr commented Dec 6, 2016

Interesting. What are the differences after serialization (df and df2 aren't identical according to your code)?

@lcolladotor
Copy link
Author

lcolladotor commented Dec 6, 2016 via email

@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants