Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input files that never change #869

Closed
adamaltmejd opened this issue May 14, 2019 · 16 comments
Closed

input files that never change #869

adamaltmejd opened this issue May 14, 2019 · 16 comments

Comments

@adamaltmejd
Copy link

I'm working with a large projects (~100GB data over lots of text files). These input files are stored read only and never change. What is the recommended way of dealing with such data? Right now I'm registering everything with file_in() but then each make() takes a while just to start, I presume because its hashing all the files. Should I just skip declaring these input files?

@wlandau
Copy link
Member

wlandau commented May 14, 2019

Great question. I do not currently work with data that large. Let's definitely keep in touch about performance.

drake uses file modification time stamps to check if the input files are even worth rehashing (#4) so file_in() should theoretically not slow you down. For the next make(), you can set console_log_file = "your.log" to help locate the bottleneck. All the log messages are time-stamped. If you share the log file (maybe your project too?) that would help me help you.

If your input files really are slowing you down, the simplest thing to do is just omit file_in(). However, that would invalidate existing targets, which is frustrating to do mid-project. To avoid rerunning things from scratch, you can use custom trigger to ensure the target only builds when it is missing from the cache.

library(drake)
plan <- drake_plan(
  y = target(
    x,
    trigger = trigger(condition = FALSE, mode = "blacklist")
  )
)
x <- "a"
make(plan)
#> target y
make(plan)
#> All targets are already up to date.
x <- "b"
make(plan)
#> All targets are already up to date.

Created on 2019-05-14 by the reprex package (v0.2.1)

@wlandau wlandau closed this as completed May 14, 2019
@wlandau
Copy link
Member

wlandau commented May 14, 2019

Also, I should mention that I use jointprof, profile, and pprof to profile drake and get nice flame graphs. Example: https://github.com/wlandau/drake-examples/tree/master/overhead.

@adamaltmejd
Copy link
Author

Thanks, thats a really great solution. Unfortunately all the data and code is confidential. But will share what I can.

Wonder if its something with the fact that the files are on Windows network folders which leads to slow load. Will investigate.

@wlandau
Copy link
Member

wlandau commented May 15, 2019

I guess you would not want the hidden .drake/ cache or those huge data files to live on your own hard drive. That is most likely for the best. In my experience, though, if the cache is on a Windows network folder and you are running drake locally, that's a huge performance penalty even for small workflows. If that network folder is a mounted drive on a server or cluster, can you run drake on the remote machine itself?

@adamaltmejd
Copy link
Author

Ah that probably explains it then. Unfortunately I don’t have any control over the setup and cannot do much about it. Will look into how the server accesses the data exactly but I think everything (including cache) is stored over network connections unfortunately.

@wlandau
Copy link
Member

wlandau commented May 15, 2019

Well, that's unfortunate. If you send me a log file, I can still see if there is something we can do to speed things up. A flame graph of make() would be even better.

By the way, how many targets do you have? A swarm of small targets could be just as expensive as a few large ones in your case.

@adamaltmejd
Copy link
Author

Will do when I get everything to run :). I only have access to network storage, the internal storage of the server gets wiped regularly. So one option would be to save the cache locally and then move it to the network folder once the run is finished. Do you think that could help?

@wlandau
Copy link
Member

wlandau commented May 17, 2019

Yes, I think caching locally and uploading afterwards could help. The uploading process could be tricky, though. drake's storr::storr_rds() cache has an intricate file system with lots of tiny files. If you use Dropbox or a similar service, you might run into #198, richfitz/storr#55, or richfitz/storr#92. Also, I recommend rsync over scp.

@adamaltmejd
Copy link
Author

Hehe, if I had access to such fancy tools I would probably not be having this issue to start with. The data is proprietary and I can only access it through a windows remote desktop connection where I can run R files. Some access to windows CMD is available so could probably run system("copy") or something like that to transfer files. Where could I send you copies of the log files? Running make() on the full code takes 12 hours 10 minutes on the first run (after clean(destroy = TRUE)) and then about 30 minutes if rerun without changing anything. Fairly OK performance after all I'd say :).

@wlandau
Copy link
Member

wlandau commented May 20, 2019

Glad to hear you are kind of satisfied. I still think it is worth checking out. 30 minutes just to check targets seems very long.

If you can do so without divulging confidential information, please feel free to upload a console log file as an attachment in this issue thread. If not, you can email it to me here. Either way, you might want to scrub it first. The cache log by itself may have file paths you do not want to share.

@adamaltmejd
Copy link
Author

Sure thing! Lots of stuff in there from verbose output from the functions that are being called that is probably not relevant.

drake_log.txt
full_make_drake_log.txt

@wlandau
Copy link
Member

wlandau commented May 22, 2019

Thanks for sharing. It is helpful to know that most of the time is spent in individual import and target steps. It really does seem like the network is causing most of the slowdown. One thing, however, troubles me. The 242 import files took around 13 minutes to process in full_make_drake_log.txt and a full 27 minutes in drake_log.txt. In the former case, drake is hashing all the import files (assuming they were never hashed before for the current .drake/ cache). But in the latter case, drake should only be reading time stamps, which should theoretically be a whole lot faster. drake does have unit tests that confirm it is using time stamps correctly, but now that I am seeing these logs, I will do some local informal testing just to be sure. Is there any chance the network was unusually slow on 5/20?

Next time, you could consider running make(skip_imports = TRUE), but that would skip your functions too.

When it comes to storing targets, drake is wasting a lot of time as well, presumably sending data over the network.

time  raw_lisa_dt  510.1s (~8.5 minutes) | 2273.28s (~37.89 minutes)  (exec | total)
time  lisa_dt  67.89s (~1.13 minutes) | 2090.05s (~34.83 minutes)  (exec | total)
time  applications  23547.31s (~6.54 hours) | 25848.51s (~7.18 hours)  (exec | total)

I would say you are better off copying the data to your local machine, running make() however many times you need as you work on the project, and then uploading the results again when you are done. Better yet, you could export some key findings into a small subset of file_out() files and just upload those.

@wlandau
Copy link
Member

wlandau commented May 22, 2019

Benchmarks

I wanted to double check that drake was dealing with input files correctly, so I created an 11 GB test file (by concatenating this large file over and over again). I put the file on a mounted drive on my company's HPC cluster and shelled into the login node. The file takes over a minute just to hash.

> system.time(digest::digest("large.csv", algo = "xxhash64", file = TRUE))
   user  system elapsed 
  1.989   5.196  81.702

Next, I fed the file to drake.

library(drake)
clean(destroy = TRUE)
plan <- drake_plan(x = file_in("large.csv"))
make(plan, console_log_file = "log1.txt")
make(plan, console_log_file = "log2.txt")

The first make() took just over 71 seconds to process the file. Here is log1.txt.

2019-05-21 23:17:10.688549 | cache ---
2019-05-21 23:17:10.811093 | analyze environment
2019-05-21 23:17:10.816621 | analyze tmp
2019-05-21 23:17:10.819698 | analyze ld
2019-05-21 23:17:10.823739 | analyze wd
2019-05-21 23:17:10.826610 | analyze td
2019-05-21 23:17:10.830344 | analyze spell_check_ignore
2019-05-21 23:17:10.834010 | analyze plan
2019-05-21 23:17:10.836789 | analyze dl
2019-05-21 23:17:10.857070 | analyze x
2019-05-21 23:17:10.872234 | connect tmp
2019-05-21 23:17:10.874992 | connect ld
2019-05-21 23:17:10.877719 | connect wd
2019-05-21 23:17:10.880207 | connect td
2019-05-21 23:17:10.882754 | connect spell_check_ignore
2019-05-21 23:17:10.885758 | connect plan
2019-05-21 23:17:10.888604 | connect dl
2019-05-21 23:17:10.891484 | connect x
2019-05-21 23:17:10.895317 | connect output files
2019-05-21 23:17:10.915471 | finalize graph edges
2019-05-21 23:17:10.918083 | create igraph
2019-05-21 23:17:10.922563 | trim neighborhoods
2019-05-21 23:17:10.929245 | add igraph attributes
2019-05-21 23:17:10.933052 | finalize igraph
2019-05-21 23:17:11.156131 | import file large.csv
2019-05-21 23:18:22.486587 | find more outdated targets
2019-05-21 23:18:22.492513 | find downstream outdated targets
2019-05-21 23:18:22.498135 | isolate oudated targets
2019-05-21 23:18:22.520895 | target x
2019-05-21 23:18:22.529389 | store x
2019-05-21 23:18:23.210537 | time  x  0s | 0.04s  (exec | total)
2019-05-21 23:18:23.223595 | done

The second make() took far less time, but still surprisingly long. It should not take 5 seconds just to get a time stamp. Here is log2.txt.

2019-05-21 23:18:23.229071 | cache ---
2019-05-21 23:18:23.253930 | analyze environment
2019-05-21 23:18:23.470061 | import file large.csv
2019-05-21 23:18:28.219654 | find more outdated targets
2019-05-21 23:18:28.240110 | find more outdated targets
2019-05-21 23:18:28.244144 | find downstream outdated targets
2019-05-21 23:18:28.247713 | isolate oudated targets
2019-05-21 23:18:28.251207 | All targets are already up to date.
2019-05-21 23:18:28.257235 | done

Sure enough, drake is still hashing something (but if if were hashing the whole file, it would take more than just 5 seconds).

library(drake)
plan <- drake_plan(x = file_in("large.csv"))
rprof_file <- tempfile()
Rprof(filename = rprof_file)
make(plan)
Rprof(NULL)
data <- profile::read_rprof(rprof_file)
profile::write_pprof(data, "prof.proto")
system2(jointprof::find_pprof(), c("-http", "0.0.0.0:8080", "prof.proto"))

Screen Shot 2019-05-21 at 11 48 26 PM

I will open a new issue.

EDIT

I really do think the file took ~70s to hash the first time and then ~5s thereafter. I was on an HPC system with mounted drives, so some file system caching was probably happening behind the scenes.

@wlandau
Copy link
Member

wlandau commented May 22, 2019

@adamaltmejd, TL;DR: apparently at some point, drake broke #4. As of 3a696ea, the development version should work much faster for your use case. Those imported files should be processed much faster if they already have hashes in the cache. Please let me know how it goes.

@adamaltmejd
Copy link
Author

Thats fantastic. Another limitation is that I cannot update packages myself or install from Github. But as soon as its up on CRAN I'll ask the server manager to make the update. Thank you for really digging into this!

@wlandau
Copy link
Member

wlandau commented May 22, 2019

Absolutely! Let's get this stuff to run fast.

Another limitation is that I cannot update packages myself or install from Github.

Well shucks, I just sent 7.3.0 to CRAN without this fix. Let's check back in about a month when 7.4.0 goes out.

wlandau-lilly added a commit that referenced this issue May 22, 2019
Relevant: #869
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants