Implement certificate exchange #378

Stebalien · 2024-06-27T04:54:14Z

This implements a basic certificate exchange protocol (which still needs tests and is probably slightly broken at the moment).

Importantly, it's missing the ability to fetch power table deltas for validating future instances (beyond the latest certificate). My plan is to implement this as a somewhat separate protocol (likely re-using a lot of the same machinery). However:

That protocol is only needed for observer nodes. Active participants in the network will follow the EC chain and will learn these power tables through the EC chain.
That protocol won't need as much guessing because we'll know which power tables should be available given the latest certificate we've received.

The large remaining TODOs are tests and design documentation. The entire protocol has been in constant flux so... I'm sure there are some inconsistencies...

Stebalien · 2024-06-27T04:55:34Z

@masih I've tagged you for review in case you want to take a look, but I'm also happy to go over it with you synchronously tomorrow morning (that'll also help me figure out the parts I need to document and find anything I might have missed).

codecov · 2024-06-27T04:57:42Z

Codecov Report

Attention: Patch coverage is 77.91304% with 127 lines in your changes missing coverage. Please review.

Project coverage is 79.14%. Comparing base (af34f65) to head (c050802).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #378      +/-   ##
==========================================
+ Coverage   79.12%   79.14%   +0.02%     
==========================================
  Files          27       36       +9     
  Lines        2836     3472     +636     
==========================================
+ Hits         2244     2748     +504     
- Misses        361      451      +90     
- Partials      231      273      +42

Files	Coverage Δ
certexchange/polling/predictor.go	`100.00% <100.00%> (ø)`
certexchange/protocol.go	`100.00% <100.00%> (ø)`
certexchange/polling/pollstatus_string.go	`0.00% <0.00%> (ø)`
certexchange/polling/discovery.go	`68.88% <68.88%> (ø)`
certexchange/polling/poller.go	`77.94% <77.94%> (ø)`
certexchange/polling/subscriber.go	`78.88% <78.88%> (ø)`
certexchange/server.go	`68.85% <68.85%> (ø)`
certexchange/client.go	`65.15% <65.15%> (ø)`
certexchange/polling/peerTracker.go	`83.58% <83.58%> (ø)`

... and 7 files with indirect coverage changes

masih

Nice work Steven 👍

go.mod

gen/main.go

certexchange/protocol.go

masih · 2024-06-27T09:38:51Z

certexchange/polling/subscriber.go

+}
+
+func (s *Subscriber) Stop() error {
+	s.stop()


I'd avoid panics where possible in case Start is not called by checking if stop is nil.

take context since this might take a bit of time to stop?

Fixed the first one.

I considered taking a context but stopping is purely cleanup. That is, there's nothing to cancel except, well, the process of cancelling.

certexchange/polling/subscriber.go

masih · 2024-06-27T09:53:18Z

certexchange/polling/predictor.go

+//
+// We don't actually know the _offset_... but whatever. We can keep up +/- one instance and that's
+// fine (especially because of the power table lag, etc.).
+func (p *predictor) update(progress uint64) time.Duration {


In the future we probably want to make the algo here plugglable. there are all sorts of optimisations we can do for faster cert exchage depending on environmental variables.
What's here makes a pretty sweet default 👍

masih · 2024-06-27T09:57:36Z

certexchange/polling/predictor.go

+	minExploreDistance   = 100 * time.Millisecond
+)
+
+func newPredictor(targetAccuracy float64, minInterval, defaultInterval, maxInterval time.Duration) *predictor {


Tests for this would be awesome for a set of well known scenarios.

Yeah, we need a lot of tests.

masih · 2024-06-27T09:59:45Z

certexchange/polling/poller.go

+		received := 0
+		for cert := range ch {
+			// TODO: consider batching verification, it's slightly faster.
+			next, _, pt, err := certs.ValidateFinalityCertificates(


For later research: I'd be curious to see if there is a correlation between the ideal interval and the length of ECChain, or the number of rounds it took to finalise it.

I thought about this but, I don't think we'll get much from that:

In terms of EC chain length, the instance after a GPBFT stall will have a long chain, but the instance after that will likely be normal unless the stall was REALLY long.

In terms of rounds, measuring interval times should be a reasonable proxy.

Also, in terms of rounds, I'm using exponential backoff for both misses and "explore distance" to try to align with the round backoff.

masih · 2024-06-27T10:01:05Z

certexchange/polling/peerTracker.go

+		return items
+	}
+
+	// Knuth 3.4.2S. Could use rand.Perm, but that would allocate a large array.


masih · 2024-06-27T10:03:22Z

certexchange/polling/predictor.go

+// An interval predictor that tries to predict the time between instances. It can't predict the time
+// an instance will be available, but it'll keep adjusting the interval until we receive one
+// instance per interval.
+type predictor struct {


I think this is plenty good as is. Way too early to optimise but writing down thoughts: In the future I suspect we can improve it further by piggybacking other peer's stats over exchanged certs request/response and feed it to a stochastic model to fine tune intervals.

I think we have all the information we need locally and I'm weary of trusting what other peers tell us too much (too easy to bias/attack us). The main improvements I can see are:

Using the number of certificates. E.g., divide the current interval by the number of certificates received. I didn't do this yet because I don't want to adjust our prediction too quickly (especially if the interval is noisy).

Have some mechanism to discover how intervals should be aligned.

But... I don't think either actually matter.

I still want a pubsub based protocol which will allow us to keep up completely.

For GPBFT's purposes, we just need to be within 5 (the power table lookback) instances.

TBH, if we're too accurate, we'll constantly be making unnecessary network requests even when GPBFT is working perfectly and doesn't need them. That was actually the main issue with my previous attempt at this algorithm (it tried to find the exact "edge" where a new certificate would be released).

Well, there are probably second-order facts we could learn to let us converge to the correct interval faster. E.g., given a sequence of events/choices, what choice should we make next that's most likely to find the optimal interval.

(we could probably straight-up use transformers for this...)

github-actions · 2024-06-27T16:44:26Z

Fuzz test failed on commit d9e41a3. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9700329175 -n testdata

Aleternatively, download directly from here.

github-actions · 2024-06-27T17:01:26Z

Fuzz test failed on commit e5cf0a7. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9700533859 -n testdata

Aleternatively, download directly from here.

Stebalien · 2024-06-28T03:13:14Z

Ok, really, the predictor should probably be a rolling window where we track how many certificates we received inside that rolling window. Then we predict windowLength/certificatesReceived. But, IMO, it works well enough for now.

Stebalien · 2024-06-28T15:20:34Z

Still writing tests, but this should be ready for review.

This implements a basic certificate exchange protocol (which still needs tests and is probably slightly broken at the moment). Importantly, it's missing the ability to fetch power table deltas for validating future instances (beyond the latest certificate). My plan is to implement this as a somewhat separate protocol (likely re-using a lot of the same machinery). However: 1. That protocol is only needed for observer nodes. Active participants in the network will follow the EC chain and will learn these power tables through the EC chain. 2. That protocol won't need as much guessing because we'll _know_ which power tables should be available given the latest certificate we've received. The large remaining TODOs are tests and design documentation. The entire protocol has been in constant flux so... I'm sure there are some inconsistencies...

It was necessary when we subscribed to new certificates, but not anymore.

It makes it harder to test.

and tests

Stebalien · 2024-07-02T02:10:09Z

Race tests are failing because I'm using real time.... I'll need to find a way to avoid that.

certexchange/client.go

Kubuxu · 2024-07-04T14:36:38Z

certexchange/polling/subscriber.go

+	s.Log.Debugf("polling %d peers for instance %d", len(peers), s.poller.NextInstance)
+	for _, peer := range peers {
+		oldInstance := s.poller.NextInstance
+		res, err := s.poller.Poll(ctx, peer)


Note that we need a timeout here to prevent peers from stalling us. It can be an issue for the future.

The "client" can (and should) be configured with a timeout. Although varying the timeout over time given the average request time would be a nice touch.

Kubuxu

SGWM, there is some complexity I wish we could avoid, but the real world is annoying.

github-actions · 2024-07-04T17:17:51Z

Fuzz test failed on commit 6dd8e20. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9797910777 -n testdata

Alternatively, download directly from here.

Stebalien requested review from masih and Kubuxu and removed request for Kubuxu June 27, 2024 04:54

Stebalien force-pushed the steb/certexchange branch from 2fe69aa to cf578f3 Compare June 27, 2024 04:56

Stebalien force-pushed the steb/certexchange branch 2 times, most recently from e92cc19 to 1a5fd80 Compare June 27, 2024 05:00

masih reviewed Jun 27, 2024

View reviewed changes

Stebalien force-pushed the steb/certexchange branch from 65865a7 to 96c26bd Compare June 28, 2024 00:29

Stebalien marked this pull request as ready for review June 28, 2024 15:20

Stebalien added 10 commits June 28, 2024 10:28

revert golang version bump

e0083a0

certexchange: []PowerEntry -> PowerEntries

adbf3b3

certexchange: better handle setup errors

9435cb5

fix generated files

c2a53ec

remove unused parameter

75692d8

remove preferential new-peer handling

046577d

It was necessary when we subscribed to new certificates, but not anymore.

remove time from the interval predictor

a8fdf91

It makes it harder to test.

basic predictor test

139089c

test and improve predictor convergence

5a3dbe0

Stebalien force-pushed the steb/certexchange branch from c5a1e7d to 5a3dbe0 Compare June 28, 2024 17:28

Stebalien added 3 commits June 28, 2024 10:36

improve convergence test

e9a2f26

test hit/miss tracking

1b159ad

improve initialization

665d845

Stebalien force-pushed the steb/certexchange branch from 431259b to 665d845 Compare June 28, 2024 22:15

fix peer-tracker off-by-one bug

0a09e4f

and tests

Stebalien added 4 commits July 4, 2024 11:03

don't require local stringer install

a87bfe9

Merge branch 'main' into steb/certexchange

8fa98b0

remove test-only method

bec6c3d

use a mock clock

5ca20b8

Stebalien force-pushed the steb/certexchange branch from 7b71c39 to 5ca20b8 Compare July 4, 2024 09:35

Stebalien added 5 commits July 4, 2024 11:50

improve timing reliability

89a32fa

improved logging

13e3283

improve test reliability

a0daf56

cleanup

4bdf36e

implement GC

33e8117

Stebalien force-pushed the steb/certexchange branch from e2622a4 to 33e8117 Compare July 4, 2024 12:00

Kubuxu reviewed Jul 4, 2024

View reviewed changes

certexchange/client.go Show resolved Hide resolved

latency tracking

f78af6e

Kubuxu reviewed Jul 4, 2024

View reviewed changes

Kubuxu approved these changes Jul 4, 2024

View reviewed changes

Stebalien added 4 commits July 4, 2024 17:20

improve reliability of subscribe tests

e4086e2

fix comment

dd747a7

fix build

d0f2fd6

spelling

8f386bf

Kubuxu approved these changes Jul 4, 2024

View reviewed changes

Stebalien added 2 commits July 4, 2024 19:08

fix heap usage

6dd8e20

test heap

f0ab0fa

make the subscribe test pass more reliably

c050802

Stebalien added this pull request to the merge queue Jul 4, 2024

Merged via the queue into main with commit 345d165 Jul 4, 2024
7 of 8 checks passed

Stebalien deleted the steb/certexchange branch July 4, 2024 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement certificate exchange #378

Implement certificate exchange #378

Stebalien commented Jun 27, 2024

Stebalien commented Jun 27, 2024

codecov bot commented Jun 27, 2024 •

edited

Loading

masih left a comment

masih Jun 27, 2024

Stebalien Jun 27, 2024

masih Jun 27, 2024

masih Jun 27, 2024

Stebalien Jun 27, 2024

masih Jun 27, 2024

Stebalien Jun 27, 2024

masih Jun 27, 2024

masih Jun 27, 2024

Stebalien Jun 27, 2024

Stebalien Jun 27, 2024

github-actions bot commented Jun 27, 2024

github-actions bot commented Jun 27, 2024

Stebalien commented Jun 28, 2024

Stebalien commented Jun 28, 2024

Stebalien commented Jul 2, 2024

Kubuxu Jul 4, 2024

Stebalien Jul 4, 2024

Kubuxu left a comment

github-actions bot commented Jul 4, 2024

Implement certificate exchange #378

Implement certificate exchange #378

Conversation

Stebalien commented Jun 27, 2024

Stebalien commented Jun 27, 2024

codecov bot commented Jun 27, 2024 • edited Loading

Codecov Report

masih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 27, 2024

github-actions bot commented Jun 27, 2024

Stebalien commented Jun 28, 2024

Stebalien commented Jun 28, 2024

Stebalien commented Jul 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kubuxu left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 4, 2024

codecov bot commented Jun 27, 2024 •

edited

Loading