Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed: get checksum only once, and limit number of characters processed #3

Open
gezakiss7 opened this issue Feb 16, 2015 · 7 comments

Comments

@gezakiss7
Copy link

Issue: When using memoizedCall or evalWithMemoization, the checksum is calculated two times, once for loadCache and once for saveCache. When the expression passed to digest is large, this can take a considerable amount of time, since there is no way to specify the length parameter to digest in getChecksum.

This can be a problem. (For example I tried to use it for caching an expression, but it turned out that calculating the checksum took approximately the same time as calculating the expression, because one of the parameters was a big vector. If you could limit the length processed by digest, memoizing this expression would be convenient; but as it is the case, either you need to restructure the program so as to avoid calculating it multiple times, or just accept that it is very slow.)

Suggestion: Make it possible to pass the checksum to loadCache and saveCache instead of the key, so that it needs to be calculated only once when calling memoizedCall or evalWithMemoization; and make it possible to set a maximum length for calculating the digest; possibly not as a parameter (only), but as a setting (e.g. set using new function setLengthForChecksum).

@HenrikBengtsson
Copy link
Owner

Thanks for pointing this out. Yes, it's indeed true that memoizedCall() and evalWithMemoization() generate the cache key twice and I agree it is unnecessary.

The API is actually already prepared for this, e.g.

cachefile <- generateCache(key=key, dirs=dirs)
res <- loadCache(pathname=cachefile)
if (!is.null(res)) return(res)
...
saveCache(res, pathname=cachefile)

I'll add it to the list of things to fix.

@gezakiss7
Copy link
Author

Great, thanks!

And wouldn't you like to make it possible to limit the data length used by digest for generating the checksum? It may not be very important, but could speed up the operation when the parameters involve big data structures.

@HenrikBengtsson
Copy link
Owner

You are thinking of the case when you do things such as:

res <- evalWithMemoization(expr)

correct? Internally, generic function getChecksum(key) is used, which means that it dispatches on the S3 class of key. It will always fall back to using digest::digest() if no matching S3 method is implement.

This means that you can override the getChecksum() used by evalWithMemoization() already now. It's a bit of a hack because it relies on the fact that evalWithMemoization() passes a key of class list where the evaluated expression is in named element expr.

So, first, define a function for truncating expressions:

trunc_expr <- function(expr) {
  if (length(expr) <= 2) return(expr)
  warning("Expression was truncated before calculating checksum")
  expr[1:2]
}

(Figuring out trunc_expr(), that is, how to really truncate expressions will be the tricky part, because expressions are nested and the "large" part may be hidden deep down.)

Then, create a custom getChecksum() for list:

getChecksum.list <- function(key, ...) {
  key$expr <- trunc_expr(key$expr)
  NextMethod("getChecksum")
}

Verifying it works:

> library(R.cache)
> expr <- substitute({ a <- 1; b <- 2; c <- 3; d <- 4 })
> length(expr)
[1] 5
> length(trunc_expr(expr))
[1] 2
Warning message:
In trunc_expr(expr) : Expression was truncated before calculating checksum
> value <- evalWithMemoization({ a <- 1; b <- 2; c <- 3; d <- 4 })
Warning message:
In trunc_expr(key$expr) :
  Expression was truncated before calculating checksum
Warning message:
In trunc_expr(key$expr) :
  Expression was truncated before calculating checksum

(there will be one one warning if already memoized, cf. Issue #3)

Long-term solutions / idea

  • evalWithMemoization() could add class attribute ExpressionKey to the key passed to getChecksum(), such that one can override getChecksum() for class ExpressionKey instead of for all list:s.
  • If (at all), truncating of expr should be supported, it should probably be done by providing a way for the user to specify a function for truncating the expression. An even more generic way is to use a customized getChecksum() method, which is the idea behind the above approach, but maybe this can be done via an argument.

@gezakiss7
Copy link
Author

That looks very nice!

I actually meant something simpler (sorry for not being clear), namely using the "length" parameter of digest::digest. For that, you would just need to be able to set a value for the "length" parameter.

But it seems that it is not a complete solution, as for that, the object first needs to be serialized, which takes time as well; only then it is not digested fully if "length" is smaller than the length of that.

I still gives some speedup: I checked using microbenchmark on my laptop, and when using length=10000 for an object whose serialized length is about 5 megabytes, I got a speedup of about 40%; and I got about 30% speedup when the serialized length is about 10 megabytes. This is not great; the only advantage is that it is very easy to implement this.

But your solution can give an arbitrary amount of speedup, depending on how the user decides to truncate the expression; so yours is definitely a better solution.

@HenrikBengtsson
Copy link
Owner

@gezakiss7, as a start, I've updated memoizedCall() to only generate the cache pathname once (Issue #5, commit bc26a1e). You can install this version using:

source("http://callr.org/install#HenrikBengtsson/R.cache@develop")

@HenrikBengtsson
Copy link
Owner

@gezakiss7, FYI, R.cache 0.12.0 is now on CRAN;

o SPEEDUP: Now memoizedCall() generates cache pathname only once.

@gezakiss7
Copy link
Author

Thanks, great! G�za


From: Henrik Bengtsson [[email protected]]
Sent: Thursday, November 12, 2015 11:21 AM
To: HenrikBengtsson/R.cache
Cc: Geza Kiss
Subject: Re: [R.cache] speed: get checksum only once, and limit number of characters processed (#3)

@gezakiss7https://github.com/gezakiss7, FYI, R.cache 0.12.0 is now on CRAN;

o SPEEDUP: Now memoizedCall() generates cache pathname only once.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-156207371.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants