-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the cache #907
Comments
This is a high-priority issue, and I expected it to come up at some point. I have some initial thoughts, and I will dive into your code when I have more time. Near the end of the deep learning chapter of the manual, I have an example of the And a smaller standalone reprex: library(drake)
plan <- drake_plan(
x = write_stuff(file_out("large_dataset.fst")),
y = read_stuff(file_in("large_dataset.fst"))
)
config <- drake_config(plan)
vis_drake_graph(config) Created on 2019-06-13 by the reprex package (v0.3.0) In both cases, But if people like you have to rely on |
Hi Will! Thanks for the fast response, as usual. I think this workaround will suffice for my immediate needs. I am not very familiar with the Storr package but I was curious if it could not be made to be more file agnostic. For example, leaving to the user to specify the format of the cache for the object through a flag on a call, and letting the function find an appropriate library to do the saving. This way the package would be more flexible to adapt to changes in the ecosystem. On the top of my head, the last years saw at least two very fast write/read implementations for R objects, with the fst and the feather packages, which could be valuable additions to storr. Leaving the saving as just a wrapper for a call to another library would make very easy to support new file formats. Faster implementations of existing formats could also be adopted with much less overhead. This implementation could be made not at the level of the overall cache, but instead of individual objects. This way, files that would have to been shared with users in other platforms would not have to be manually managed, but instead would be handled by the cache directly. In other words, you spent less time managing names, calls to read/write and worrying if the files are updated or not. As a beneficial side effect, you wouldn't have to worry to create tools to "guess" which would be the best file format for an object. You just set a default, and if the user feels like, he can change it to another one. More flexibility with (apparently) less work. Not sure how difficult would be to do something along those lines, and if this kind of implementation would be better suited to be made at the level of storr or Drake, though. |
I had some problems using Basically, even after running the function, the target file would appear as "missing", thus making the cache start the plan from that point onwards. If I remove the |
I do have plans for this: richfitz/storr#103. It seems most natural to set the format at the class/type level. Also, inspired by advice from @eddelbuettel here, I am attempting to leverage library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#> user system elapsed
#> 0.362 0.019 0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#> user system elapsed
#> 0.314 1.340 1.689 but there are some roadblocks with big data / long vectors. library(fst)
x <- data.frame(x = raw(2^32))
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
x <- list(x = raw(2^32))
as.data.frame(x)
#> Warning in attributes(.Data) <- c(attributes(.Data), attrib): NAs
#> introduced by coercion to integer range
#> Error in if (mirn && nrows[i] > 0L) {: missing value where TRUE/FALSE needed
class(x) <- "data.frame"
file <- tempfile()
write_fst(x, file) # No error here...
# read_fst(file) # but I get a segfault here. Created on 2019-06-16 by the reprex package (v0.3.0)
No, this is not expected. Would you open another issue and post a reprex? |
Great! There seems to be a considerable speed up using fst to compress files, and the save rds after compression as richfitz/storr#110 appears to save considerable trouble.
I will create a reprex and post it as a new issue asap. |
As I just learned, I totally inundated @richfitz this week (sorry about that) so richfitz/storr#111 could take a while. For now, you can try creating a library(drake)
cache <- storr::storr_rds(tempfile(), compress = FALSE)
plan <- drake_plan(x = 1)
make(plan, cache = cache)
#> target x
readd(x, cache = cache)
#> [1] 1 Created on 2019-06-18 by the reprex package (v0.3.0) |
To do:
|
Hi Will, Just a quick update. I tried recreating the |
From the benchmarks at richfitz/storr#111, it looks like library(drake)
library(storr)
load_mtcars_example()
cache <- storr_rds(tempfile(), compress = FALSE)
make(my_plan, cache = cache) |
Hi Will, Sorry for the late response. I finally had the time to test some benchmarks.
Using drake 7.3.0
Using drake 7.5.2
So, it appears that I did not try to enable compression inside drake, because the overhead would be even greater. |
Thank you running benchmarks in a practical scenario, this is very useful! What happens if you install wlandau/storr@deea50d and try |
Not sure if I installed correctly, but I did: library(devtools)
devtools::install_github("wlandau/storr", ref = "110") The version with lz4 compression took 15.33 min and created an object with size of 3.2Gb. |
Awesome, thanks! richfitz/storr#111 seems to get us half way there. To fully achieve the efficiency of |
I now consider this issue solved via #977. Now, all you need for large data frames is |
Prework
drake
's code of conduct.Description
Following #891, I am now able to produce several datasets to be fed to a classifier algorithm logging their parameters using Mlflow.
However, I am now having trouble with the caching system of Drake.
I'm performing dimensionality reduction using algos such as t-sne and word2vec with different parameters. Each of them produces a different dataset that will be used with a different classifier.
Since the datasets are huge (~ 6 million rows), Drake takes a long time to save them on cache, and it does so in an uncompressed way. As a result, it takes about 5 min to save each rds file, and the files themselves have sizes of about 1.1. Gb.
If instead, I use the fst library and max compression, I can store each file in 13s with about 300Mb in size, when doing this outside Drake.
So, I'm divided on how to address this issue.
I could change the calls in the targets to avoid them returning datasets so Drake would not store them in the cache, and make a call inside the function to save them directly on disk using the fst library.
The problem, however, is that I will use those same datasets in the next step of the plan when passing them as arguments to the classifiers, so it is not clear how I would reference them. I could just make the classifiers read the files on the disk directly but that would not make reference to read / write operations on disk.
On the other hand, I could use parallelism with the rds native cache to speed things up. That, however, would not help to reduce the size of such files, and it seems to be a waste of resources.
Finally, I could just make the saving and reading transparent to Drake, and just address the plan using some kind of referencing. See the last example. It took me about 20s for each object.
Reproducible example
This would be the basic plan - using drake cache - but it is too slow
Using
file_in
andfile_out
gives a strange graph (max_expand
set to aid visualization):Finally, the best solution so far (it doesn't make use of
file_in
orfile_out
though) :Benchmarks
A similar sized dataset would be:
Timings:
The text was updated successfully, but these errors were encountered: