speed: get checksum only once, and limit number of characters processed #3

gezakiss7 · 2015-02-16T13:27:57Z

Issue: When using memoizedCall or evalWithMemoization, the checksum is calculated two times, once for loadCache and once for saveCache. When the expression passed to digest is large, this can take a considerable amount of time, since there is no way to specify the length parameter to digest in getChecksum.

This can be a problem. (For example I tried to use it for caching an expression, but it turned out that calculating the checksum took approximately the same time as calculating the expression, because one of the parameters was a big vector. If you could limit the length processed by digest, memoizing this expression would be convenient; but as it is the case, either you need to restructure the program so as to avoid calculating it multiple times, or just accept that it is very slow.)

Suggestion: Make it possible to pass the checksum to loadCache and saveCache instead of the key, so that it needs to be calculated only once when calling memoizedCall or evalWithMemoization; and make it possible to set a maximum length for calculating the digest; possibly not as a parameter (only), but as a setting (e.g. set using new function setLengthForChecksum).

HenrikBengtsson · 2015-02-16T19:13:10Z

Thanks for pointing this out. Yes, it's indeed true that memoizedCall() and evalWithMemoization() generate the cache key twice and I agree it is unnecessary.

The API is actually already prepared for this, e.g.

cachefile <- generateCache(key=key, dirs=dirs)
res <- loadCache(pathname=cachefile)
if (!is.null(res)) return(res)
...
saveCache(res, pathname=cachefile)

I'll add it to the list of things to fix.

gezakiss7 · 2015-02-16T19:18:00Z

Great, thanks!

And wouldn't you like to make it possible to limit the data length used by digest for generating the checksum? It may not be very important, but could speed up the operation when the parameters involve big data structures.

HenrikBengtsson · 2015-02-16T19:58:53Z

You are thinking of the case when you do things such as:

res <- evalWithMemoization(expr)

correct? Internally, generic function getChecksum(key) is used, which means that it dispatches on the S3 class of key. It will always fall back to using digest::digest() if no matching S3 method is implement.

This means that you can override the getChecksum() used by evalWithMemoization() already now. It's a bit of a hack because it relies on the fact that evalWithMemoization() passes a key of class list where the evaluated expression is in named element expr.

So, first, define a function for truncating expressions:

trunc_expr <- function(expr) {
  if (length(expr) <= 2) return(expr)
  warning("Expression was truncated before calculating checksum")
  expr[1:2]
}

(Figuring out trunc_expr(), that is, how to really truncate expressions will be the tricky part, because expressions are nested and the "large" part may be hidden deep down.)

Then, create a custom getChecksum() for list:

getChecksum.list <- function(key, ...) {
  key$expr <- trunc_expr(key$expr)
  NextMethod("getChecksum")
}

Verifying it works:

> library(R.cache)
> expr <- substitute({ a <- 1; b <- 2; c <- 3; d <- 4 })
> length(expr)
[1] 5
> length(trunc_expr(expr))
[1] 2
Warning message:
In trunc_expr(expr) : Expression was truncated before calculating checksum
> value <- evalWithMemoization({ a <- 1; b <- 2; c <- 3; d <- 4 })
Warning message:
In trunc_expr(key$expr) :
  Expression was truncated before calculating checksum
Warning message:
In trunc_expr(key$expr) :
  Expression was truncated before calculating checksum

(there will be one one warning if already memoized, cf. Issue #3)

Long-term solutions / idea

evalWithMemoization() could add class attribute ExpressionKey to the key passed to getChecksum(), such that one can override getChecksum() for class ExpressionKey instead of for all list:s.
If (at all), truncating of expr should be supported, it should probably be done by providing a way for the user to specify a function for truncating the expression. An even more generic way is to use a customized getChecksum() method, which is the idea behind the above approach, but maybe this can be done via an argument.

gezakiss7 · 2015-02-16T20:37:58Z

That looks very nice!

I actually meant something simpler (sorry for not being clear), namely using the "length" parameter of digest::digest. For that, you would just need to be able to set a value for the "length" parameter.

But it seems that it is not a complete solution, as for that, the object first needs to be serialized, which takes time as well; only then it is not digested fully if "length" is smaller than the length of that.

I still gives some speedup: I checked using microbenchmark on my laptop, and when using length=10000 for an object whose serialized length is about 5 megabytes, I got a speedup of about 40%; and I got about 30% speedup when the serialized length is about 10 megabytes. This is not great; the only advantage is that it is very easy to implement this.

But your solution can give an arbitrary amount of speedup, depending on how the user decides to truncate the expression; so yours is definitely a better solution.

HenrikBengtsson · 2015-02-28T00:39:22Z

@gezakiss7, as a start, I've updated memoizedCall() to only generate the cache pathname once (Issue #5, commit bc26a1e). You can install this version using:

source("http://callr.org/install#HenrikBengtsson/R.cache@develop")

HenrikBengtsson · 2015-11-12T19:21:22Z

@gezakiss7, FYI, R.cache 0.12.0 is now on CRAN;

o SPEEDUP: Now memoizedCall() generates cache pathname only once.

gezakiss7 · 2015-11-12T19:23:05Z

Thanks, great! G�za

From: Henrik Bengtsson [[email protected]]
Sent: Thursday, November 12, 2015 11:21 AM
To: HenrikBengtsson/R.cache
Cc: Geza Kiss
Subject: Re: [R.cache] speed: get checksum only once, and limit number of characters processed (#3)

@gezakiss7 https://github.com/gezakiss7, FYI, R.cache 0.12.0 is now on CRAN;

o SPEEDUP: Now memoizedCall() generates cache pathname only once.

�
Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-156207371.

HenrikBengtsson added the enhancement label Feb 16, 2015

This was referenced Feb 24, 2015

SPEEDUP: memoizedCall() should only need to generate cache key one #5

Closed

generateCache(key, ...) to pass ... to getChecksum()? #6

Open

HenrikBengtsson added the feature request label Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed: get checksum only once, and limit number of characters processed #3

speed: get checksum only once, and limit number of characters processed #3

gezakiss7 commented Feb 16, 2015

HenrikBengtsson commented Feb 16, 2015

gezakiss7 commented Feb 16, 2015

HenrikBengtsson commented Feb 16, 2015

gezakiss7 commented Feb 16, 2015

HenrikBengtsson commented Feb 28, 2015

HenrikBengtsson commented Nov 12, 2015

gezakiss7 commented Nov 12, 2015

speed: get checksum only once, and limit number of characters processed #3

speed: get checksum only once, and limit number of characters processed #3

Comments

gezakiss7 commented Feb 16, 2015

HenrikBengtsson commented Feb 16, 2015

gezakiss7 commented Feb 16, 2015

HenrikBengtsson commented Feb 16, 2015

Long-term solutions / idea

gezakiss7 commented Feb 16, 2015

HenrikBengtsson commented Feb 28, 2015

HenrikBengtsson commented Nov 12, 2015

gezakiss7 commented Nov 12, 2015