-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiformat driver #103
Comments
Yes, feel free to write a design document for this. Make sure that you consider what happens when a driver is reopened:
This is the biggest pain with supporting options with drivers at present. I think that your idea of storing the file with an extension indicating the format for deserialisation is nice The get/set bits above are at odds with the bits of storr that won't be changed - you will mostly be concerned with the functions:
|
Awesome! FYI: I have not forgotten about this. I really do intend to write this spec and hopefully implement it. For the last month or so, however, I have been simultaneously trying to put out fires at my day job and prepare to give a |
In the RDS driver, I ran some benchmarks just for my own understanding. I can see why it makes sense to serialize up front in the RDS driver, but I am not sure this is the best approach for other file formats like library(digest)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
serialize_to_raw <- function(x) {
serialize(x, NULL, ascii = FALSE, xdr = TRUE)
}
nrow <- 6e6
ncol <- 20
data <- as.data.frame(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
object_size(data)
#> 960 MB
file <- tempfile()
# serialization
system.time(obj <- serialize_to_raw(data))
#> user system elapsed
#> 2.428 0.453 2.886
# storage
system.time(saveRDS(data, file))
#> user system elapsed
#> 102.012 1.293 104.799
con <- file(file, "wb")
system.time(writeBin(obj, con))
#> user system elapsed
#> 0.432 1.570 2.244
close(con)
# hashing
system.time(digest(file, algo = "xxhash64", file = TRUE))
#> user system elapsed
#> 0.250 0.312 0.574
system.time(digest(data, algo = "xxhash64"))
#> user system elapsed
#> 2.714 0.524 3.315
system.time(digest(obj, algo = "xxhash64", serialize = FALSE))
#> user system elapsed
#> 0.120 0.001 0.122 Created on 2019-06-14 by the reprex package (v0.3.0) |
An aside: #107 will not totally replace the usefulness of #103 because some formats surpass the speed of x <- data.frame(x = integer(30e7))
obj <- serialize(x, NULL)
library(fst)
system.time(write_fst(x, tempfile()))
#> user system elapsed
#> 0.627 0.000 0.200
system.time(writeBin(obj, tempfile()))
#> user system elapsed
#> 0.218 0.741 0.958 Created on 2019-06-15 by the reprex package (v0.3.0) |
Re #109 (comment), I no longer think we need all the features I had in mind for the multiformat driver. What if we just allow some control over the serialization format in the existing RDS driver (e.g. for Keras models)? |
Re #77 (comment), I would like to propose a new driver that handles this slightly differently. It requires more work up front, but I think it could allow for more customization and future-proofing in the long run. @richfitz, if you like the idea, please let me know and I will write a more thorough design document.
Initialization
The proposed multiformat driver accepts a custom read/write protocol on initialization. The default format is RDS, and
storr_multiformat()
with an emptyformats
argument should behave likestorr_rds()
.We could store the format protocol in an R script that gets
source()
d when we callstorr_multiformat()
on an existingstorr
.If a multiformat
storr
already exists at the given path, the user should not be allowed to set theformats
argument.Storage
s$set(key, value)
couldvalue
given its S3 class.hash
is equal to"object"
for the given format, serialize and hashvalue
in memory.scratch/
.hash
is equal to"file"
, hash the temporary file without having serialized anything.HASH.EXT
, whereEXT
is the file extension we gave in the protocol.Retrieval
s$get(key)
couldread
function in the protocol.The text was updated successfully, but these errors were encountered: