Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Poplar1 preparation with different IDPF evaluation caches #1038

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

divergentdave
Copy link
Collaborator

This adds a new category of benchmark that evaluates a Poplar1 report at every level with a realistic set of aggregation parameters. This is repeated with multiple different IDPF evaluation caches, and done by reusing the cache across multiple preparations at different levels, rather than discarding the cache between preparations. Several different cache implementations are used, including the existing ones, some optimizations on top of those inspired by #706, and the binary tree introduced in #978. Here are the results I got:

Criterion output
poplar1_prepare_cached/ringbuffer_100/16
                        time:   [3.0772 ms 3.0802 ms 3.0836 ms]
                        change: [-16.725% -14.487% -12.335%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
poplar1_prepare_cached/ringbuffer_a_100/16
                        time:   [2.7827 ms 2.7839 ms 2.7860 ms]
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe
poplar1_prepare_cached/ringbuffer_b_100/16
                        time:   [3.3041 ms 3.3049 ms 3.3059 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/hashmap/16
                        time:   [2.8163 ms 2.8181 ms 2.8210 ms]
                        change: [-1.3160% -0.9884% -0.7324%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
poplar1_prepare_cached/hashmap_a/16
                        time:   [2.7674 ms 2.7685 ms 2.7700 ms]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
poplar1_prepare_cached/hashmap_b/16
                        time:   [2.8382 ms 2.8391 ms 2.8406 ms]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
poplar1_prepare_cached/binarytree/16
                        time:   [2.5032 ms 2.5060 ms 2.5098 ms]
                        change: [+0.7524% +0.9031% +1.0655%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) high mild
  4 (4.00%) high severe
poplar1_prepare_cached/ringbuffer_100/128
                        time:   [26.897 ms 26.904 ms 26.912 ms]
                        change: [-0.7460% -0.6385% -0.5527%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
poplar1_prepare_cached/ringbuffer_a_100/128
                        time:   [24.263 ms 24.267 ms 24.274 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/ringbuffer_b_100/128
                        time:   [28.462 ms 28.481 ms 28.507 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
poplar1_prepare_cached/hashmap/128
                        time:   [31.600 ms 31.611 ms 31.627 ms]
                        change: [-2.5733% -2.0554% -1.6271%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/hashmap_a/128
                        time:   [24.320 ms 24.337 ms 24.364 ms]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
poplar1_prepare_cached/hashmap_b/128
                        time:   [25.436 ms 25.457 ms 25.491 ms]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
poplar1_prepare_cached/binarytree/128
                        time:   [25.584 ms 25.599 ms 25.617 ms]
                        change: [-0.3466% -0.2297% -0.1171%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
poplar1_prepare_cached/ringbuffer_100/256
                        time:   [54.242 ms 54.264 ms 54.298 ms]
                        change: [-1.1951% -0.9858% -0.8324%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
poplar1_prepare_cached/ringbuffer_a_100/256
                        time:   [48.952 ms 48.972 ms 49.003 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
poplar1_prepare_cached/ringbuffer_b_100/256
                        time:   [56.317 ms 56.513 ms 56.784 ms]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
poplar1_prepare_cached/hashmap/256
                        time:   [78.971 ms 78.981 ms 78.993 ms]
                        change: [-1.8087% -1.5994% -1.4575%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
poplar1_prepare_cached/hashmap_a/256
                        time:   [49.691 ms 49.709 ms 49.735 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
poplar1_prepare_cached/hashmap_b/256
                        time:   [53.904 ms 53.916 ms 53.929 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
poplar1_prepare_cached/binarytree/256
                        time:   [64.081 ms 64.094 ms 64.108 ms]
                        change: [-0.1716% -0.0872% -0.0107%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

The as-is HashMapCache performs the worst, as laid out by #706. The two best-performing caches are variants of HashMapCache and RingBufferCache that copy inputs into a new BitVec, normalize alignment and uninitialized bits, and compare or hash raw storage slices. I tried also writing variants that didn't copy to a new BitVec, but used different comparison routines. (for equality, destructuring the Domain of the bit slice; for hashing, iterating over machine word-sized chunks of the bit slice, and hashing a usize at a time) Neither of these performed as well as the corresponding variant that just copies into a new BitVec. The binary tree cache came out between the original HashMapCache and the rest, looking at the 256 bit case. Its curve is notably a different shape than the rest. If the Poplar1 implementation were more tightly integrated with the binary tree, it would probably perform better. For example, the initial traversal up the IDPF tree looking for a cached node, doing multiple lookups on the BinaryTree, could be replaced with a single traversal down the BinaryTree. Additionally, insertions would be faster if Poplar1 hung onto a node reference, and walked down one pointer as it inserted new nodes.

This PR contains one commit that would be worth cherry-picking and landing on its own immediately: an efficiency improvement to the existing Poplar1 benchmarks. Based on the results above, I think it would make sense to apply the normalization optimization to HashMapCache, and start using that in Poplar1. There's also an API design question here: how do we want to expose the IDPF evaluation cache, and side effects on it, in VDAFs that use IDPFs? Should we keep using a mutable reference to the cache, or should we use a & reference and require interior mutability? Should there be a new trait related to Aggregator that allows for this sort of cache argument, or is it sufficient to expose a cache-aware version of prepare_init and not provide a trait for it? Would the cache interface be easier to use and implement correctly if the nonce were an argument in get and insert calls as well?

@cjpatton
Copy link
Collaborator

Thanks @divergentdave. It looks like the ring buffer is still the best option for Poplar1? I wonder if we re-implemented BinaryTree with a Vec we would see a performance improvement.

To the API question: I'd go with a mutable reference to cache. I wouldn't expose this in the Aggregator API. Based on my experience prototyping Mastic in Daphne, I expect there to be Poplar1-specific logic that's already needed. I don't think a generic API for caching obviously helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants