Multiformat driver #103

wlandau · 2019-04-27T01:16:52Z

Re #77 (comment), I would like to propose a new driver that handles this slightly differently. It requires more work up front, but I think it could allow for more customization and future-proofing in the long run. @richfitz, if you like the idea, please let me know and I will write a more thorough design document.

Initialization

The proposed multiformat driver accepts a custom read/write protocol on initialization. The default format is RDS, and storr_multiformat() with an empty formats argument should behave like storr_rds().

s <- storr_multiformat(
  path,
  formats = storr_format_protocol(
    storr_format(
      class = c("keras.engine.sequential.Sequential"),
      extension = "keras",
      hash = "object",
      serialize = function(object) {
        keras::serialize_model(object)
      },
      unserialize = function(raw) {
        keras::unserialize_model(raw)
      },
      read = function(filepath = path) {
        readRDS(file = path)
      },
      write = function(object, path) {
        saveRDS(object = object, file = path)
      }
    ),
    storr_format(
      extn = "fst",
      class = "data.frame",
      hash = "file", # Hash the file, not the in-memory data. Avoids serialization.
      read = function(path) {
        # Read in fst format.
      },
      write = function(object, path) {
        # Write in fst format.
      },
    )
  )
)

We could store the format protocol in an R script that gets source()d when we call storr_multiformat() on an existing storr.

path/
├── config/
├───── formats.R
├───── hash_algorithm
├───── mangle_key
├───── version
├── data/
├── keys/
└── scratch/

If a multiformat storr already exists at the given path, the user should not be allowed to set the formats argument.

s <- storr_multiformat(
  path,
  formats = storr_format_protocol(storr_format(...))
)
#> Error: cannot set formats of an existing multiformat storr.

Storage

s$set(key, value) could

Choose the most appropriate format for value given its S3 class.
If hash is equal to "object" for the given format, serialize and hash value in memory.
Save the object to a temporary file in scratch/.
If hash is equal to "file", hash the temporary file without having serialized anything.
Move the file to HASH.EXT, where EXT is the file extension we gave in the protocol.

Retrieval

s$get(key) could

Get the file extension of the data file.
Identify the format in which it was originally saved.
Read the data using the read function in the protocol.

The text was updated successfully, but these errors were encountered:

richfitz · 2019-04-29T06:14:51Z

Yes, feel free to write a design document for this. Make sure that you consider what happens when a driver is reopened:

do all the options need to be provided
do all the options need to agree
will it be possible to migrate a set of options to a different set

This is the biggest pain with supporting options with drivers at present.

I think that your idea of storing the file with an extension indicating the format for deserialisation is nice

The get/set bits above are at odds with the bits of storr that won't be changed - you will mostly be concerned with the functions:

    get_hash = function(key, namespace) {
    },
    set_hash = function(key, namespace, hash) {
    },
    get_object = function(hash) {
    },
    set_object = function(hash, value) {
    },
    exists_hash = function(key, namespace) {
    },
    exists_object = function(hash) {
    },
    del_hash = function(key, namespace) {
    },
    del_object = function(hash) {
    },
    list_hashes = function() {
    },
    list_keys = function(namespace) {
    },
    list_namespaces = function() {
    }

wlandau · 2019-05-11T05:35:45Z

Awesome! FYI: I have not forgotten about this. I really do intend to write this spec and hopefully implement it. For the last month or so, however, I have been simultaneously trying to put out fires at my day job and prepare to give a drake workshop. Hopefully I will have more spare time within the next couple months.

wlandau · 2019-06-14T12:51:04Z

In the RDS driver, set_object() assumes the value is already serialized. Is it possible under the storr framework to avoid in-memory serialization altogether and just hash the file after we save it?

I ran some benchmarks just for my own understanding. I can see why it makes sense to serialize up front in the RDS driver, but I am not sure this is the best approach for other file formats like fst.

library(digest)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
    
serialize_to_raw <- function(x) {
    serialize(x, NULL, ascii = FALSE, xdr = TRUE)
}
    
nrow <- 6e6
ncol <- 20
data <- as.data.frame(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
object_size(data)
#> 960 MB

file <- tempfile()

# serialization
system.time(obj <- serialize_to_raw(data))
#>    user  system elapsed 
#>   2.428   0.453   2.886

# storage
system.time(saveRDS(data, file))
#>    user  system elapsed 
#> 102.012   1.293 104.799
con <- file(file, "wb")
system.time(writeBin(obj, con))
#>    user  system elapsed 
#>   0.432   1.570   2.244
close(con)

# hashing
system.time(digest(file, algo = "xxhash64", file = TRUE))
#>    user  system elapsed 
#>   0.250   0.312   0.574
system.time(digest(data, algo = "xxhash64"))
#>    user  system elapsed 
#>   2.714   0.524   3.315
system.time(digest(obj, algo = "xxhash64", serialize = FALSE))
#>    user  system elapsed 
#>   0.120   0.001   0.122

^{Created on 2019-06-14 by the reprex package (v0.3.0)}

wlandau · 2019-06-16T01:52:15Z

An aside: #107 will not totally replace the usefulness of #103 because some formats surpass the speed of writeBin(). fst is an example.

x <- data.frame(x = integer(30e7))
obj <- serialize(x, NULL)
library(fst)
system.time(write_fst(x, tempfile()))
#>    user  system elapsed 
#>   0.627   0.000   0.200
system.time(writeBin(obj, tempfile()))
#>    user  system elapsed 
#>   0.218   0.741   0.958

^{Created on 2019-06-15 by the reprex package (v0.3.0)}

wlandau · 2019-06-18T11:38:23Z

Re #109 (comment), I no longer think we need all the features I had in mind for the multiformat driver. What if we just allow some control over the serialization format in the existing RDS driver (e.g. for Keras models)?

wlandau mentioned this issue May 26, 2019

Restrict environment of call to only those elements used ropensci/drake#882

Closed

3 tasks

wlandau mentioned this issue Jun 13, 2019

Speed up the cache ropensci/drake#907

Closed

4 tasks

This was referenced Jun 14, 2019

Use writeBin() for large objects? #106

Closed

writeBin() in chunks #107

Open

This was referenced Jun 16, 2019

fst driver #108

Open

fst for arbitrary data storage? fstpackage/fst#201

Closed

Use compress_fst() in the RDS driver? #110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiformat driver #103

Multiformat driver #103

wlandau commented Apr 27, 2019 •

edited

Loading

richfitz commented Apr 29, 2019

wlandau commented May 11, 2019

wlandau commented Jun 14, 2019

wlandau commented Jun 16, 2019

wlandau commented Jun 18, 2019 •

edited

Loading

Multiformat driver #103

Multiformat driver #103

Comments

wlandau commented Apr 27, 2019 • edited Loading

Initialization

Storage

Retrieval

richfitz commented Apr 29, 2019

wlandau commented May 11, 2019

wlandau commented Jun 14, 2019

wlandau commented Jun 16, 2019

wlandau commented Jun 18, 2019 • edited Loading

wlandau commented Apr 27, 2019 •

edited

Loading

wlandau commented Jun 18, 2019 •

edited

Loading