-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
input files that never change #869
Comments
Great question. I do not currently work with data that large. Let's definitely keep in touch about performance.
If your input files really are slowing you down, the simplest thing to do is just omit library(drake)
plan <- drake_plan(
y = target(
x,
trigger = trigger(condition = FALSE, mode = "blacklist")
)
)
x <- "a"
make(plan)
#> target y
make(plan)
#> All targets are already up to date.
x <- "b"
make(plan)
#> All targets are already up to date. Created on 2019-05-14 by the reprex package (v0.2.1) |
Also, I should mention that I use |
Thanks, thats a really great solution. Unfortunately all the data and code is confidential. But will share what I can. Wonder if its something with the fact that the files are on Windows network folders which leads to slow load. Will investigate. |
I guess you would not want the hidden |
Ah that probably explains it then. Unfortunately I don’t have any control over the setup and cannot do much about it. Will look into how the server accesses the data exactly but I think everything (including cache) is stored over network connections unfortunately. |
Well, that's unfortunate. If you send me a log file, I can still see if there is something we can do to speed things up. A flame graph of make() would be even better. By the way, how many targets do you have? A swarm of small targets could be just as expensive as a few large ones in your case. |
Will do when I get everything to run :). I only have access to network storage, the internal storage of the server gets wiped regularly. So one option would be to save the cache locally and then move it to the network folder once the run is finished. Do you think that could help? |
Yes, I think caching locally and uploading afterwards could help. The uploading process could be tricky, though. |
Hehe, if I had access to such fancy tools I would probably not be having this issue to start with. The data is proprietary and I can only access it through a windows remote desktop connection where I can run R files. Some access to windows CMD is available so could probably run |
Glad to hear you are kind of satisfied. I still think it is worth checking out. 30 minutes just to check targets seems very long. If you can do so without divulging confidential information, please feel free to upload a console log file as an attachment in this issue thread. If not, you can email it to me here. Either way, you might want to scrub it first. The cache log by itself may have file paths you do not want to share. |
Sure thing! Lots of stuff in there from verbose output from the functions that are being called that is probably not relevant. |
Thanks for sharing. It is helpful to know that most of the time is spent in individual Next time, you could consider running When it comes to storing targets,
I would say you are better off copying the data to your local machine, running |
BenchmarksI wanted to double check that
Next, I fed the file to library(drake)
clean(destroy = TRUE)
plan <- drake_plan(x = file_in("large.csv"))
make(plan, console_log_file = "log1.txt")
make(plan, console_log_file = "log2.txt") The first
The second
Sure enough, library(drake)
plan <- drake_plan(x = file_in("large.csv"))
rprof_file <- tempfile()
Rprof(filename = rprof_file)
make(plan)
Rprof(NULL)
data <- profile::read_rprof(rprof_file)
profile::write_pprof(data, "prof.proto")
system2(jointprof::find_pprof(), c("-http", "0.0.0.0:8080", "prof.proto")) I will open a new issue. EDITI really do think the file took ~70s to hash the first time and then ~5s thereafter. I was on an HPC system with mounted drives, so some file system caching was probably happening behind the scenes. |
@adamaltmejd, TL;DR: apparently at some point, |
Thats fantastic. Another limitation is that I cannot update packages myself or install from Github. But as soon as its up on CRAN I'll ask the server manager to make the update. Thank you for really digging into this! |
Absolutely! Let's get this stuff to run fast.
Well shucks, I just sent 7.3.0 to CRAN without this fix. Let's check back in about a month when 7.4.0 goes out. |
I'm working with a large projects (~100GB data over lots of text files). These input files are stored read only and never change. What is the recommended way of dealing with such data? Right now I'm registering everything with file_in() but then each make() takes a while just to start, I presume because its hashing all the files. Should I just skip declaring these input files?
The text was updated successfully, but these errors were encountered: