-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rehash files when and only when necessary. Use timestamps to decide. #4
Comments
For 2.0.0, I got rid of the |
Would using It's built into I ran a test this morning comparing the library(tidyverse)
library(progress)
library(dplyr)
library(stringr)
library(microbenchmark)
library(digest)
results <- NULL
algo_list <- c("md5", "sha1", "xxhash32", "xxhash64", "murmur32")
n_loops <- 25
pb <- progress::progress_bar$new(
format = ":algo :size [:bar] :percent eta: :eta",
total = (2^n_loops) * length(algo_list)
)
tf <- tempfile(fileext = ".RDS")
for (i in seq(from = 1, to = n_loops, by = 1)){
rn <- rnorm(2^i)
saveRDS(rn, tf)
for (algo in algo_list){
pb$tick(len = 2^i, tokens = list(algo = stringr::str_pad(algo, 9), size = stringr::str_pad(i, 3)))
mbx <- microbenchmark::microbenchmark(
object = digest::digest(rn, algo = "md5", file = FALSE),
file = digest::digest(tf, algo = algo, file = TRUE),
unit = "s"
)
filesize <- file.size(tf)
objectsize = object.size(rn)
mbx <- dplyr::mutate(mbx,
algorithm = algo,
size = ifelse(expr == "file", filesize, objectsize)
)
results <- dplyr::bind_rows(
results,
mbx
)
}
}
print("saving")
saveRDS(results, "hash_results.RDS")
print("plotting")
ggplot(
results,
aes(
x = size / (10^6),
y = time / (10^9),
group = factor(algorithm),
color = factor(algorithm),
shape = factor(expr)
)
) +
labs(
x = "Mb", y = "Seconds"
) +
facet_wrap(~ expr) +
# geom_point() +
geom_smooth()
I can't embed the RDS in a comment here, But I can make gist if you want. The individual results mostly show that |
@AlexAxthelm Thanks for the idea! I will need some time to think about this, but I am absolutely interested. |
@AlexAxthelm A few initial thoughts:
|
Also, richfitz/storr#20 is relevant here. |
And yet md5 and sha1 seem to be more commonly used, possibly because they may be better at avoiding collisions. Safety may be worth extra hashing time, and I think we should investigate further. I did notice that xxhash64 hashes are relatively short. write.csv(mtcars, "mtcars.csv")
library(digest)
digest("mtcars.csv", file = TRUE, algo = "sha1")
## [1] "0d84e86eebe633a69a78206e47f3522eafe0fbbf"
digest("mtcars.csv", file = TRUE, algo = "md5")
## [1] "6463474bfe6973a81dc7cbc4a71e8dd1"
digest("mtcars.csv", file = TRUE, algo = "xxhash64")
## [1] "cd094514792ef806" |
Hadn't considered noticed that |
In these past several months, I have not had any new ideas about making better use of file modification times. The original point of this issue has exhausted its shelf life. For the latest discussion of drake's hashing, see #53. |
The decision to hash will be better encapsulated with the should_rehash_file <- function(filename, new_mtime, old_mtime,
size_cutoff){
do_rehash <- file.size(filename) < size_cutoff | new_mtime > old_mtime
if (is.na(do_rehash)){
do_rehash <- TRUE
}
do_rehash
}
file_hash <- function(target, config, size_cutoff = 1e5) {
if (is_not_file(target))
return(as.character(NA))
filename <- eply::unquote(target)
if (!file.exists(filename))
return(as.character(NA))
old_mtime <- ifelse(target %in% config$inventory_filemtime,
config$cache$get(key = target, namespace = "filemtime"),
-Inf)
new_mtime <- file.mtime(filename)
do_rehash <- should_rehash_file(
filename = filename,
new_mtime = new_mtime,
old_mtime = old_mtime,
size_cutoff = size_cutoff)
if (do_rehash){
rehash_file(target)
} else {
config$cache$get(target)$value
}
} |
Back to the original issue: we could consider |
Oops, I keep forgetting: whatever we do in R, the actual file system will use its own potentially low-resolution method for assigning modification times to files. Never mind. |
At the risk of adding more complexity, could we use an optional meta-object that contains the |
I think it would be easy to include nanotimes in the existing target-level metadata. But if the user manually edits a file using, say, a text editor, how do we get that nanotime? This is exactly what I keep forgetting to consider. |
We wouldn't. |
Hmm... that could actually work. I will need to think about it some more. I think it would be great to simplify the logic in should_rehash_file() and increase reliance on time stamps over hashes. Because there is still a loss in precision here, I am not yet sure if high-res time stamps will make a difference there, and it would need to be weighed against adding microbenchmark as a package dependency. |
Update: effective 9793232,
cc @skolenik. |
I use fingerprints (hashes) to detect when the user's external files change. Fingerprints are expensive, so I use file modification times to judge whether fingerprinting is even worth the expense. The trouble is that
file.mtime()
is egregiously imprecise on Windows and Mac, so true updates to files could potentially be missed. My current workaround is to force a rehash whenThis rule mostly covers it, but manual changes to any medium-to-large file may be ignored if drake looks at that file in the same second. I would really like to just get more precise timestamps. With even millisecond precision, I could just wait until the next increment in mtime before importing or building the next file.
Sometime in the future, I may be able to assume all file systems support high-resolution times. Apparently, R 3.3.3 will have this solved for Windows. But until that happens on all platforms, I do not think I can solve this issue.
The text was updated successfully, but these errors were encountered: