-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed: get checksum only once, and limit number of characters processed #3
Comments
Thanks for pointing this out. Yes, it's indeed true that The API is actually already prepared for this, e.g. cachefile <- generateCache(key=key, dirs=dirs)
res <- loadCache(pathname=cachefile)
if (!is.null(res)) return(res)
...
saveCache(res, pathname=cachefile) I'll add it to the list of things to fix. |
Great, thanks! And wouldn't you like to make it possible to limit the data length used by digest for generating the checksum? It may not be very important, but could speed up the operation when the parameters involve big data structures. |
You are thinking of the case when you do things such as:
correct? Internally, generic function This means that you can override the getChecksum() used by So, first, define a function for truncating expressions: trunc_expr <- function(expr) {
if (length(expr) <= 2) return(expr)
warning("Expression was truncated before calculating checksum")
expr[1:2]
} (Figuring out Then, create a custom getChecksum.list <- function(key, ...) {
key$expr <- trunc_expr(key$expr)
NextMethod("getChecksum")
} Verifying it works: > library(R.cache)
> expr <- substitute({ a <- 1; b <- 2; c <- 3; d <- 4 })
> length(expr)
[1] 5
> length(trunc_expr(expr))
[1] 2
Warning message:
In trunc_expr(expr) : Expression was truncated before calculating checksum
> value <- evalWithMemoization({ a <- 1; b <- 2; c <- 3; d <- 4 })
Warning message:
In trunc_expr(key$expr) :
Expression was truncated before calculating checksum
Warning message:
In trunc_expr(key$expr) :
Expression was truncated before calculating checksum (there will be one one warning if already memoized, cf. Issue #3) Long-term solutions / idea
|
That looks very nice! I actually meant something simpler (sorry for not being clear), namely using the "length" parameter of digest::digest. For that, you would just need to be able to set a value for the "length" parameter. But it seems that it is not a complete solution, as for that, the object first needs to be serialized, which takes time as well; only then it is not digested fully if "length" is smaller than the length of that. I still gives some speedup: I checked using microbenchmark on my laptop, and when using length=10000 for an object whose serialized length is about 5 megabytes, I got a speedup of about 40%; and I got about 30% speedup when the serialized length is about 10 megabytes. This is not great; the only advantage is that it is very easy to implement this. But your solution can give an arbitrary amount of speedup, depending on how the user decides to truncate the expression; so yours is definitely a better solution. |
@gezakiss7, as a start, I've updated source("http://callr.org/install#HenrikBengtsson/R.cache@develop") |
@gezakiss7, FYI, R.cache 0.12.0 is now on CRAN;
|
Thanks, great! G�za From: Henrik Bengtsson [[email protected]] @gezakiss7https://github.com/gezakiss7, FYI, R.cache 0.12.0 is now on CRAN; o SPEEDUP: Now memoizedCall() generates cache pathname only once. � |
Issue: When using memoizedCall or evalWithMemoization, the checksum is calculated two times, once for loadCache and once for saveCache. When the expression passed to digest is large, this can take a considerable amount of time, since there is no way to specify the length parameter to digest in getChecksum.
This can be a problem. (For example I tried to use it for caching an expression, but it turned out that calculating the checksum took approximately the same time as calculating the expression, because one of the parameters was a big vector. If you could limit the length processed by digest, memoizing this expression would be convenient; but as it is the case, either you need to restructure the program so as to avoid calculating it multiple times, or just accept that it is very slow.)
Suggestion: Make it possible to pass the checksum to loadCache and saveCache instead of the key, so that it needs to be calculated only once when calling memoizedCall or evalWithMemoization; and make it possible to set a maximum length for calculating the digest; possibly not as a parameter (only), but as a setting (e.g. set using new function setLengthForChecksum).
The text was updated successfully, but these errors were encountered: