Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiformat driver #103

Open
wlandau opened this issue Apr 27, 2019 · 5 comments
Open

Multiformat driver #103

wlandau opened this issue Apr 27, 2019 · 5 comments

Comments

@wlandau
Copy link
Contributor

wlandau commented Apr 27, 2019

Re #77 (comment), I would like to propose a new driver that handles this slightly differently. It requires more work up front, but I think it could allow for more customization and future-proofing in the long run. @richfitz, if you like the idea, please let me know and I will write a more thorough design document.

Initialization

The proposed multiformat driver accepts a custom read/write protocol on initialization. The default format is RDS, and storr_multiformat() with an empty formats argument should behave like storr_rds().

s <- storr_multiformat(
  path,
  formats = storr_format_protocol(
    storr_format(
      class = c("keras.engine.sequential.Sequential"),
      extension = "keras",
      hash = "object",
      serialize = function(object) {
        keras::serialize_model(object)
      },
      unserialize = function(raw) {
        keras::unserialize_model(raw)
      },
      read = function(filepath = path) {
        readRDS(file = path)
      },
      write = function(object, path) {
        saveRDS(object = object, file = path)
      }
    ),
    storr_format(
      extn = "fst",
      class = "data.frame",
      hash = "file", # Hash the file, not the in-memory data. Avoids serialization.
      read = function(path) {
        # Read in fst format.
      },
      write = function(object, path) {
        # Write in fst format.
      },
    )
  )
)

We could store the format protocol in an R script that gets source()d when we call storr_multiformat() on an existing storr.

path/
├── config/
├───── formats.R
├───── hash_algorithm
├───── mangle_key
├───── version
├── data/
├── keys/
└── scratch/

If a multiformat storr already exists at the given path, the user should not be allowed to set the formats argument.

s <- storr_multiformat(
  path,
  formats = storr_format_protocol(storr_format(...))
)
#> Error: cannot set formats of an existing multiformat storr.

Storage

s$set(key, value) could

  1. Choose the most appropriate format for value given its S3 class.
  2. If hash is equal to "object" for the given format, serialize and hash value in memory.
  3. Save the object to a temporary file in scratch/.
  4. If hash is equal to "file", hash the temporary file without having serialized anything.
  5. Move the file to HASH.EXT, where EXT is the file extension we gave in the protocol.

Retrieval

s$get(key) could

  1. Get the file extension of the data file.
  2. Identify the format in which it was originally saved.
  3. Read the data using the read function in the protocol.
@richfitz
Copy link
Owner

Yes, feel free to write a design document for this. Make sure that you consider what happens when a driver is reopened:

  • do all the options need to be provided
  • do all the options need to agree
  • will it be possible to migrate a set of options to a different set

This is the biggest pain with supporting options with drivers at present.

I think that your idea of storing the file with an extension indicating the format for deserialisation is nice

The get/set bits above are at odds with the bits of storr that won't be changed - you will mostly be concerned with the functions:

    get_hash = function(key, namespace) {
    },
    set_hash = function(key, namespace, hash) {
    },
    get_object = function(hash) {
    },
    set_object = function(hash, value) {
    },
    exists_hash = function(key, namespace) {
    },
    exists_object = function(hash) {
    },
    del_hash = function(key, namespace) {
    },
    del_object = function(hash) {
    },
    list_hashes = function() {
    },
    list_keys = function(namespace) {
    },
    list_namespaces = function() {
    }

@wlandau
Copy link
Contributor Author

wlandau commented May 11, 2019

Awesome! FYI: I have not forgotten about this. I really do intend to write this spec and hopefully implement it. For the last month or so, however, I have been simultaneously trying to put out fires at my day job and prepare to give a drake workshop. Hopefully I will have more spare time within the next couple months.

@wlandau
Copy link
Contributor Author

wlandau commented Jun 14, 2019

In the RDS driver, set_object() assumes the value is already serialized. Is it possible under the storr framework to avoid in-memory serialization altogether and just hash the file after we save it?

I ran some benchmarks just for my own understanding. I can see why it makes sense to serialize up front in the RDS driver, but I am not sure this is the best approach for other file formats like fst.

library(digest)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
    
serialize_to_raw <- function(x) {
    serialize(x, NULL, ascii = FALSE, xdr = TRUE)
}
    
nrow <- 6e6
ncol <- 20
data <- as.data.frame(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
object_size(data)
#> 960 MB

file <- tempfile()

# serialization
system.time(obj <- serialize_to_raw(data))
#>    user  system elapsed 
#>   2.428   0.453   2.886

# storage
system.time(saveRDS(data, file))
#>    user  system elapsed 
#> 102.012   1.293 104.799
con <- file(file, "wb")
system.time(writeBin(obj, con))
#>    user  system elapsed 
#>   0.432   1.570   2.244
close(con)

# hashing
system.time(digest(file, algo = "xxhash64", file = TRUE))
#>    user  system elapsed 
#>   0.250   0.312   0.574
system.time(digest(data, algo = "xxhash64"))
#>    user  system elapsed 
#>   2.714   0.524   3.315
system.time(digest(obj, algo = "xxhash64", serialize = FALSE))
#>    user  system elapsed 
#>   0.120   0.001   0.122

Created on 2019-06-14 by the reprex package (v0.3.0)

@wlandau
Copy link
Contributor Author

wlandau commented Jun 16, 2019

An aside: #107 will not totally replace the usefulness of #103 because some formats surpass the speed of writeBin(). fst is an example.

x <- data.frame(x = integer(30e7))
obj <- serialize(x, NULL)
library(fst)
system.time(write_fst(x, tempfile()))
#>    user  system elapsed 
#>   0.627   0.000   0.200
system.time(writeBin(obj, tempfile()))
#>    user  system elapsed 
#>   0.218   0.741   0.958

Created on 2019-06-15 by the reprex package (v0.3.0)

@wlandau
Copy link
Contributor Author

wlandau commented Jun 18, 2019

Re #109 (comment), I no longer think we need all the features I had in mind for the multiformat driver. What if we just allow some control over the serialization format in the existing RDS driver (e.g. for Keras models)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants