Benchmark Poplar1 preparation with different IDPF evaluation caches #1038
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a new category of benchmark that evaluates a Poplar1 report at every level with a realistic set of aggregation parameters. This is repeated with multiple different IDPF evaluation caches, and done by reusing the cache across multiple preparations at different levels, rather than discarding the cache between preparations. Several different cache implementations are used, including the existing ones, some optimizations on top of those inspired by #706, and the binary tree introduced in #978. Here are the results I got:
Criterion output
The as-is
HashMapCache
performs the worst, as laid out by #706. The two best-performing caches are variants ofHashMapCache
andRingBufferCache
that copy inputs into a newBitVec
, normalize alignment and uninitialized bits, and compare or hash raw storage slices. I tried also writing variants that didn't copy to a newBitVec
, but used different comparison routines. (for equality, destructuring theDomain
of the bit slice; for hashing, iterating over machine word-sized chunks of the bit slice, and hashing a usize at a time) Neither of these performed as well as the corresponding variant that just copies into a newBitVec
. The binary tree cache came out between the originalHashMapCache
and the rest, looking at the 256 bit case. Its curve is notably a different shape than the rest. If the Poplar1 implementation were more tightly integrated with the binary tree, it would probably perform better. For example, the initial traversal up the IDPF tree looking for a cached node, doing multiple lookups on the BinaryTree, could be replaced with a single traversal down the BinaryTree. Additionally, insertions would be faster if Poplar1 hung onto a node reference, and walked down one pointer as it inserted new nodes.This PR contains one commit that would be worth cherry-picking and landing on its own immediately: an efficiency improvement to the existing Poplar1 benchmarks. Based on the results above, I think it would make sense to apply the normalization optimization to
HashMapCache
, and start using that in Poplar1. There's also an API design question here: how do we want to expose the IDPF evaluation cache, and side effects on it, in VDAFs that use IDPFs? Should we keep using a mutable reference to the cache, or should we use a&
reference and require interior mutability? Should there be a new trait related toAggregator
that allows for this sort of cache argument, or is it sufficient to expose a cache-aware version of prepare_init and not provide a trait for it? Would the cache interface be easier to use and implement correctly if the nonce were an argument in get and insert calls as well?