Rehash files when and only when necessary. Use timestamps to decide. #4

wlandau-lilly · 2017-02-27T14:03:18Z

I use fingerprints (hashes) to detect when the user's external files change. Fingerprints are expensive, so I use file modification times to judge whether fingerprinting is even worth the expense. The trouble is that file.mtime() is egregiously imprecise on Windows and Mac, so true updates to files could potentially be missed. My current workaround is to force a rehash when

the file is less than 100 Kb, or
the file was just built by drake (not imported)

This rule mostly covers it, but manual changes to any medium-to-large file may be ignored if drake looks at that file in the same second. I would really like to just get more precise timestamps. With even millisecond precision, I could just wait until the next increment in mtime before importing or building the next file.

Sometime in the future, I may be able to assume all file systems support high-resolution times. Apparently, R 3.3.3 will have this solved for Windows. But until that happens on all platforms, I do not think I can solve this issue.

The text was updated successfully, but these errors were encountered:

wlandau-lilly · 2017-03-05T16:35:17Z

For 2.0.0, I got rid of the force_rehash argument. It's esoteric for most users, and it shouldn't really be necessary.

AlexAxthelm · 2017-08-08T15:58:54Z

Would using xxhash64 rather than md5 as the hashing algorithm be sufficient? It's noticeably faster on files, and not slower on objects in memory.

It's built into digest already. There would have to be some logic to identify hashes that are already md5, to prevent rebuilding, but that same function could update the md5 hash to use xxhash64 to make things faster on the next time around.

I ran a test this morning comparing the digest algorithms:

library(tidyverse)
library(progress)
library(dplyr)
library(stringr)
library(microbenchmark)
library(digest)

results <- NULL
algo_list <- c("md5", "sha1",  "xxhash32", "xxhash64", "murmur32")
n_loops <- 25
pb <- progress::progress_bar$new(
  format = ":algo :size [:bar] :percent eta: :eta",
  total = (2^n_loops) * length(algo_list)
  )
tf <- tempfile(fileext = ".RDS")

for (i in seq(from = 1, to = n_loops, by = 1)){
  rn <- rnorm(2^i)
  saveRDS(rn, tf)
  for (algo in algo_list){
    pb$tick(len = 2^i, tokens = list(algo = stringr::str_pad(algo, 9), size = stringr::str_pad(i, 3)))
    mbx <- microbenchmark::microbenchmark(
      object = digest::digest(rn, algo = "md5", file = FALSE),
      file = digest::digest(tf, algo = algo, file = TRUE),
      unit = "s"
    )
    filesize <- file.size(tf)
    objectsize = object.size(rn)
    mbx <- dplyr::mutate(mbx,
      algorithm = algo,
      size = ifelse(expr == "file", filesize, objectsize)
      )
    results <- dplyr::bind_rows(
      results,
      mbx
    )
  }
}

print("saving")
saveRDS(results, "hash_results.RDS")

print("plotting")
ggplot(
  results,
  aes(
    x = size / (10^6),
    y = time / (10^9),
    group = factor(algorithm),
    color = factor(algorithm),
    shape = factor(expr)
    )
) + 
  labs(
    x = "Mb", y = "Seconds"
  ) +
  facet_wrap(~ expr) +
  # geom_point() +
  geom_smooth()

I can't embed the RDS in a comment here, But I can make gist if you want. The individual results mostly show that md5 has a higher variance in hash time than th xx family.

wlandau-lilly · 2017-08-08T16:03:08Z

@AlexAxthelm Thanks for the idea! I will need some time to think about this, but I am absolutely interested.

wlandau-lilly · 2017-08-08T17:03:18Z

@AlexAxthelm A few initial thoughts:

Speeding up and avoiding hashing are the best things we can do to boost performance, so +1 bringing this up.
Changing the algorithm could threaten back compatibility, but as you say, we could grandfather in md5 sums if we're careful.
Faster hashing would be great, but it would not totally solve this particular issue. The original purpose was to make better use of file modification times to detect cases where we already know the file is up to date: i.e., where it would be pointless to hash at all.

wlandau-lilly · 2017-08-08T17:03:49Z

Also, richfitz/storr#20 is relevant here.

wlandau-lilly · 2017-08-08T17:21:22Z

And yet md5 and sha1 seem to be more commonly used, possibly because they may be better at avoiding collisions. Safety may be worth extra hashing time, and I think we should investigate further. I did notice that xxhash64 hashes are relatively short.

write.csv(mtcars, "mtcars.csv")
library(digest)
digest("mtcars.csv", file = TRUE, algo = "sha1")
## [1] "0d84e86eebe633a69a78206e47f3522eafe0fbbf"
digest("mtcars.csv", file = TRUE, algo = "md5")
## [1] "6463474bfe6973a81dc7cbc4a71e8dd1"
digest("mtcars.csv", file = TRUE, algo = "xxhash64")
## [1] "cd094514792ef806"

AlexAxthelm · 2017-08-08T18:30:32Z

Hadn't considered noticed that xxhash64 was actually shorter. I guess that's one way to make things faster. 😃

wlandau-lilly · 2017-08-11T17:06:28Z

In these past several months, I have not had any new ideas about making better use of file modification times. The original point of this issue has exhausted its shelf life. For the latest discussion of drake's hashing, see #53.

wlandau-lilly · 2017-08-24T03:48:51Z

The decision to hash will be better encapsulated with the should_rehash_file() function. See R/hash.R in the next release (v4.1.0).

should_rehash_file <- function(filename, new_mtime, old_mtime,
  size_cutoff){
  do_rehash <- file.size(filename) < size_cutoff | new_mtime > old_mtime
  if (is.na(do_rehash)){
    do_rehash <- TRUE
  }
  do_rehash
}

file_hash <- function(target, config, size_cutoff = 1e5) {
  if (is_not_file(target))
    return(as.character(NA))
  filename <- eply::unquote(target)
  if (!file.exists(filename))
    return(as.character(NA))
  old_mtime <- ifelse(target %in% config$inventory_filemtime,
    config$cache$get(key = target, namespace = "filemtime"),
    -Inf)
  new_mtime <- file.mtime(filename)
  do_rehash <- should_rehash_file(
    filename = filename,
    new_mtime = new_mtime,
    old_mtime = old_mtime,
    size_cutoff = size_cutoff)
  if (do_rehash){
    rehash_file(target)
  } else {
    config$cache$get(target)$value
  }
}

wlandau · 2018-11-13T14:48:07Z

Back to the original issue: we could consider microbenchmark::get_nanotime() for timestamps. It seems to resolve the resolution issues I was seeing on Windows, and it could help clean up some messy logic.

wlandau · 2018-11-13T14:53:33Z

Oops, I keep forgetting: whatever we do in R, the actual file system will use its own potentially low-resolution method for assigning modification times to files. Never mind.

AlexAxthelm · 2018-11-16T14:54:18Z

At the risk of adding more complexity, could we use an optional meta-object that contains the nanotimes for file creation, and drake checks if that meta-object exists, and uses those precise timestamps if it does, and falls back to the filesystem if not. That way, the default behavior doesn't change, but if someone is having problems, they could flip a flag to get the higher precision.

wlandau · 2018-11-16T15:02:58Z

I think it would be easy to include nanotimes in the existing target-level metadata. But if the user manually edits a file using, say, a text editor, how do we get that nanotime? This is exactly what I keep forgetting to consider.

AlexAxthelm · 2018-11-16T15:05:55Z

We wouldn't. drake would need to use max(<<filesystem_time>>, <<recorded_nanotime>>), and we accept that if the user edits a file outside of drake, then that precision can never be recovered.

wlandau · 2018-11-16T18:20:52Z

Hmm... that could actually work. I will need to think about it some more. I think it would be great to simplify the logic in should_rehash_file() and increase reliance on time stamps over hashes. Because there is still a loss in precision here, I am not yet sure if high-res time stamps will make a difference there, and it would need to be weighed against adding microbenchmark as a package dependency.

wlandau · 2019-06-02T04:32:28Z

Update: effective 9793232, drake now uses file.size() to help decide whether to rehash files. The new decision rule is to recompute the hash if and only if:

The file is under 100 KB, or
The new time stamp is later than the old time stamp, or
The file size changed.

cc @skolenik.

Fast forward from master

wlandau-lilly added the type: new feature label Feb 27, 2017

wlandau-lilly changed the title ~~Improve use of timestamps to minimize unnecessary rehashing of files~~ Timestamps: minimize time spent rehashing of files, but still rebuild files when needed. Mar 7, 2017

wlandau-lilly added the difficulty: advanced label Apr 3, 2017

wlandau-lilly added the topic: performance label Aug 8, 2017

This was referenced Aug 8, 2017

file caching richfitz/storr#20

Open

Possibly support xxhash64 richfitz/storr#49

Closed

wlandau-lilly mentioned this issue Aug 8, 2017

Faster hashing #53

Closed

wlandau-lilly changed the title ~~Timestamps: minimize time spent rehashing of files, but still rebuild files when needed.~~ Rehash files when and only when necessary. Use timestamps to decide. Aug 8, 2017

wlandau-lilly closed this as completed Aug 11, 2017

wlandau-lilly mentioned this issue Sep 8, 2017

Replace tools::md5sum() with digest(file = TRUE, algo = hash_algorithm) #86

Closed

wlandau-lilly mentioned this issue Sep 15, 2017

expose algo / digest in cache_ r-lib/memoise#28

Closed

wlandau removed difficulty: advanced type: new feature labels Nov 13, 2018

wlandau mentioned this issue May 14, 2019

input files that never change #869

Closed

wlandau mentioned this issue May 22, 2019

drake is excessively hashing imported files #878

Closed

3 tasks

wlandau mentioned this issue Jun 13, 2019

Speed up the cache #907

Closed

4 tasks

wlandau mentioned this issue Sep 14, 2019

Accommodation of script-based imperative workflows #994

Closed

2 tasks

wlandau mentioned this issue Oct 13, 2019

Avoid read_from_meta() in outdated() #1027

Closed

3 tasks

wlandau pushed a commit that referenced this issue Oct 22, 2019

Merge pull request #4 from mstr3336/doc_clickthrough_ff

b6342b9

Fast forward from master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rehash files when and only when necessary. Use timestamps to decide. #4

Rehash files when and only when necessary. Use timestamps to decide. #4

wlandau-lilly commented Feb 27, 2017 •

edited

Loading

wlandau-lilly commented Mar 5, 2017

AlexAxthelm commented Aug 8, 2017 •

edited

Loading

wlandau-lilly commented Aug 8, 2017

wlandau-lilly commented Aug 8, 2017 •

edited

Loading

wlandau-lilly commented Aug 8, 2017

wlandau-lilly commented Aug 8, 2017

AlexAxthelm commented Aug 8, 2017

wlandau-lilly commented Aug 11, 2017

wlandau-lilly commented Aug 24, 2017

wlandau commented Nov 13, 2018

wlandau commented Nov 13, 2018

AlexAxthelm commented Nov 16, 2018

wlandau commented Nov 16, 2018 •

edited

Loading

AlexAxthelm commented Nov 16, 2018

wlandau commented Nov 16, 2018

wlandau commented Jun 2, 2019

Rehash files when and only when necessary. Use timestamps to decide. #4

Rehash files when and only when necessary. Use timestamps to decide. #4

Comments

wlandau-lilly commented Feb 27, 2017 • edited Loading

wlandau-lilly commented Mar 5, 2017

AlexAxthelm commented Aug 8, 2017 • edited Loading

wlandau-lilly commented Aug 8, 2017

wlandau-lilly commented Aug 8, 2017 • edited Loading

wlandau-lilly commented Aug 8, 2017

wlandau-lilly commented Aug 8, 2017

AlexAxthelm commented Aug 8, 2017

wlandau-lilly commented Aug 11, 2017

wlandau-lilly commented Aug 24, 2017

wlandau commented Nov 13, 2018

wlandau commented Nov 13, 2018

AlexAxthelm commented Nov 16, 2018

wlandau commented Nov 16, 2018 • edited Loading

AlexAxthelm commented Nov 16, 2018

wlandau commented Nov 16, 2018

wlandau commented Jun 2, 2019

wlandau-lilly commented Feb 27, 2017 •

edited

Loading

AlexAxthelm commented Aug 8, 2017 •

edited

Loading

wlandau-lilly commented Aug 8, 2017 •

edited

Loading

wlandau commented Nov 16, 2018 •

edited

Loading