Benchmark Poplar1 preparation with different IDPF evaluation caches #1038

divergentdave · 2024-05-10T21:40:58Z

This adds a new category of benchmark that evaluates a Poplar1 report at every level with a realistic set of aggregation parameters. This is repeated with multiple different IDPF evaluation caches, and done by reusing the cache across multiple preparations at different levels, rather than discarding the cache between preparations. Several different cache implementations are used, including the existing ones, some optimizations on top of those inspired by #706, and the binary tree introduced in #978. Here are the results I got:

Criterion output

poplar1_prepare_cached/ringbuffer_100/16
                        time:   [3.0772 ms 3.0802 ms 3.0836 ms]
                        change: [-16.725% -14.487% -12.335%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
poplar1_prepare_cached/ringbuffer_a_100/16
                        time:   [2.7827 ms 2.7839 ms 2.7860 ms]
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe
poplar1_prepare_cached/ringbuffer_b_100/16
                        time:   [3.3041 ms 3.3049 ms 3.3059 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/hashmap/16
                        time:   [2.8163 ms 2.8181 ms 2.8210 ms]
                        change: [-1.3160% -0.9884% -0.7324%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
poplar1_prepare_cached/hashmap_a/16
                        time:   [2.7674 ms 2.7685 ms 2.7700 ms]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
poplar1_prepare_cached/hashmap_b/16
                        time:   [2.8382 ms 2.8391 ms 2.8406 ms]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
poplar1_prepare_cached/binarytree/16
                        time:   [2.5032 ms 2.5060 ms 2.5098 ms]
                        change: [+0.7524% +0.9031% +1.0655%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) high mild
  4 (4.00%) high severe
poplar1_prepare_cached/ringbuffer_100/128
                        time:   [26.897 ms 26.904 ms 26.912 ms]
                        change: [-0.7460% -0.6385% -0.5527%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
poplar1_prepare_cached/ringbuffer_a_100/128
                        time:   [24.263 ms 24.267 ms 24.274 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/ringbuffer_b_100/128
                        time:   [28.462 ms 28.481 ms 28.507 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
poplar1_prepare_cached/hashmap/128
                        time:   [31.600 ms 31.611 ms 31.627 ms]
                        change: [-2.5733% -2.0554% -1.6271%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/hashmap_a/128
                        time:   [24.320 ms 24.337 ms 24.364 ms]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
poplar1_prepare_cached/hashmap_b/128
                        time:   [25.436 ms 25.457 ms 25.491 ms]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
poplar1_prepare_cached/binarytree/128
                        time:   [25.584 ms 25.599 ms 25.617 ms]
                        change: [-0.3466% -0.2297% -0.1171%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
poplar1_prepare_cached/ringbuffer_100/256
                        time:   [54.242 ms 54.264 ms 54.298 ms]
                        change: [-1.1951% -0.9858% -0.8324%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
poplar1_prepare_cached/ringbuffer_a_100/256
                        time:   [48.952 ms 48.972 ms 49.003 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
poplar1_prepare_cached/ringbuffer_b_100/256
                        time:   [56.317 ms 56.513 ms 56.784 ms]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
poplar1_prepare_cached/hashmap/256
                        time:   [78.971 ms 78.981 ms 78.993 ms]
                        change: [-1.8087% -1.5994% -1.4575%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/hashmap_a/256
                        time:   [49.691 ms 49.709 ms 49.735 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
poplar1_prepare_cached/hashmap_b/256
                        time:   [53.904 ms 53.916 ms 53.929 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
poplar1_prepare_cached/binarytree/256
                        time:   [64.081 ms 64.094 ms 64.108 ms]
                        change: [-0.1716% -0.0872% -0.0107%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

The as-is HashMapCache performs the worst, as laid out by #706. The two best-performing caches are variants of HashMapCache and RingBufferCache that copy inputs into a new BitVec, normalize alignment and uninitialized bits, and compare or hash raw storage slices. I tried also writing variants that didn't copy to a new BitVec, but used different comparison routines. (for equality, destructuring the Domain of the bit slice; for hashing, iterating over machine word-sized chunks of the bit slice, and hashing a usize at a time) Neither of these performed as well as the corresponding variant that just copies into a new BitVec. The binary tree cache came out between the original HashMapCache and the rest, looking at the 256 bit case. Its curve is notably a different shape than the rest. If the Poplar1 implementation were more tightly integrated with the binary tree, it would probably perform better. For example, the initial traversal up the IDPF tree looking for a cached node, doing multiple lookups on the BinaryTree, could be replaced with a single traversal down the BinaryTree. Additionally, insertions would be faster if Poplar1 hung onto a node reference, and walked down one pointer as it inserted new nodes.

This PR contains one commit that would be worth cherry-picking and landing on its own immediately: an efficiency improvement to the existing Poplar1 benchmarks. Based on the results above, I think it would make sense to apply the normalization optimization to HashMapCache, and start using that in Poplar1. There's also an API design question here: how do we want to expose the IDPF evaluation cache, and side effects on it, in VDAFs that use IDPFs? Should we keep using a mutable reference to the cache, or should we use a & reference and require interior mutability? Should there be a new trait related to Aggregator that allows for this sort of cache argument, or is it sufficient to expose a cache-aware version of prepare_init and not provide a trait for it? Would the cache interface be easier to use and implement correctly if the nonce were an argument in get and insert calls as well?

Using Bencher::iter_batched() is only necessary if the benchmarked routine needs to consume its output, which is not the case here. This change lets us avoid unnecessarily repeating identical setup computations.

cjpatton · 2024-05-13T14:18:04Z

Thanks @divergentdave. It looks like the ring buffer is still the best option for Poplar1? I wonder if we re-implemented BinaryTree with a Vec we would see a performance improvement.

To the API question: I'd go with a mutable reference to cache. I wouldn't expose this in the Aggregator API. Based on my experience prototyping Mastic in Daphne, I expect there to be Poplar1-specific logic that's already needed. I don't think a generic API for caching obviously helps.

divergentdave added 7 commits May 9, 2024 12:03

Add IdpfCache implementation using BinaryTree

46da09a

Add method to do Poplar1 preparation with a cache

16a28b5

Stop using iter_batched in Poplar1 benchmarks

05c09e1

Using Bencher::iter_batched() is only necessary if the benchmarked routine needs to consume its output, which is not the case here. This change lets us avoid unnecessarily repeating identical setup computations.

Make BinaryTree public for use in benchmarks

9b8067a

Benchmark Poplar1 with different IDPF node caches

fd91263

Test out various bitslice comparison optimizations

58a053d

Fix compilation with Rust 1.71

40948f0

This was referenced May 16, 2024

Stop using iter_batched in Poplar1 benchmarks #1056

Merged

IDPF caches: compare bit vectors a usize at a time #1057

Merged

cjpatton mentioned this pull request Jan 9, 2025

Investigate optimal caching strategy for IDPF evaluation in Poplar1 #441

Open

divergentdave closed this Jan 17, 2025

divergentdave deleted the david/heavy-hitters-benchmarking branch January 17, 2025 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Poplar1 preparation with different IDPF evaluation caches #1038

Benchmark Poplar1 preparation with different IDPF evaluation caches #1038

divergentdave commented May 10, 2024

cjpatton commented May 13, 2024

Benchmark Poplar1 preparation with different IDPF evaluation caches #1038

Benchmark Poplar1 preparation with different IDPF evaluation caches #1038

Conversation

divergentdave commented May 10, 2024

cjpatton commented May 13, 2024