-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement certificate exchange #378
Conversation
@masih I've tagged you for review in case you want to take a look, but I'm also happy to go over it with you synchronously tomorrow morning (that'll also help me figure out the parts I need to document and find anything I might have missed). |
2fe69aa
to
cf578f3
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #378 +/- ##
==========================================
+ Coverage 79.12% 79.14% +0.02%
==========================================
Files 27 36 +9
Lines 2836 3472 +636
==========================================
+ Hits 2244 2748 +504
- Misses 361 451 +90
- Partials 231 273 +42
|
e92cc19
to
1a5fd80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Steven 👍
certexchange/polling/subscriber.go
Outdated
} | ||
|
||
func (s *Subscriber) Stop() error { | ||
s.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'd avoid panics where possible in case
Start
is not called by checking if stop isnil
. - take context since this might take a bit of time to stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the first one.
I considered taking a context but stopping is purely cleanup. That is, there's nothing to cancel except, well, the process of cancelling.
// | ||
// We don't actually know the _offset_... but whatever. We can keep up +/- one instance and that's | ||
// fine (especially because of the power table lag, etc.). | ||
func (p *predictor) update(progress uint64) time.Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we probably want to make the algo here plugglable. there are all sorts of optimisations we can do for faster cert exchage depending on environmental variables.
What's here makes a pretty sweet default 👍
certexchange/polling/predictor.go
Outdated
minExploreDistance = 100 * time.Millisecond | ||
) | ||
|
||
func newPredictor(targetAccuracy float64, minInterval, defaultInterval, maxInterval time.Duration) *predictor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests for this would be awesome for a set of well known scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need a lot of tests.
received := 0 | ||
for cert := range ch { | ||
// TODO: consider batching verification, it's slightly faster. | ||
next, _, pt, err := certs.ValidateFinalityCertificates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later research: I'd be curious to see if there is a correlation between the ideal interval and the length of ECChain
, or the number of rounds it took to finalise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but, I don't think we'll get much from that:
- In terms of EC chain length, the instance after a GPBFT stall will have a long chain, but the instance after that will likely be normal unless the stall was REALLY long.
- In terms of rounds, measuring interval times should be a reasonable proxy.
Also, in terms of rounds, I'm using exponential backoff for both misses and "explore distance" to try to align with the round backoff.
return items | ||
} | ||
|
||
// Knuth 3.4.2S. Could use rand.Perm, but that would allocate a large array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
// An interval predictor that tries to predict the time between instances. It can't predict the time | ||
// an instance will be available, but it'll keep adjusting the interval until we receive one | ||
// instance per interval. | ||
type predictor struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is plenty good as is. Way too early to optimise but writing down thoughts: In the future I suspect we can improve it further by piggybacking other peer's stats over exchanged certs request/response and feed it to a stochastic model to fine tune intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have all the information we need locally and I'm weary of trusting what other peers tell us too much (too easy to bias/attack us). The main improvements I can see are:
- Using the number of certificates. E.g., divide the current interval by the number of certificates received. I didn't do this yet because I don't want to adjust our prediction too quickly (especially if the interval is noisy).
- Have some mechanism to discover how intervals should be aligned.
But... I don't think either actually matter.
- I still want a pubsub based protocol which will allow us to keep up completely.
- For GPBFT's purposes, we just need to be within 5 (the power table lookback) instances.
TBH, if we're too accurate, we'll constantly be making unnecessary network requests even when GPBFT is working perfectly and doesn't need them. That was actually the main issue with my previous attempt at this algorithm (it tried to find the exact "edge" where a new certificate would be released).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there are probably second-order facts we could learn to let us converge to the correct interval faster. E.g., given a sequence of events/choices, what choice should we make next that's most likely to find the optimal interval.
(we could probably straight-up use transformers for this...)
Fuzz test failed on commit d9e41a3. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9700329175 -n testdata Aleternatively, download directly from here. |
Fuzz test failed on commit e5cf0a7. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9700533859 -n testdata Aleternatively, download directly from here. |
65865a7
to
96c26bd
Compare
Ok, really, the predictor should probably be a rolling window where we track how many certificates we received inside that rolling window. Then we predict |
Still writing tests, but this should be ready for review. |
This implements a basic certificate exchange protocol (which still needs tests and is probably slightly broken at the moment). Importantly, it's missing the ability to fetch power table deltas for validating future instances (beyond the latest certificate). My plan is to implement this as a somewhat separate protocol (likely re-using a lot of the same machinery). However: 1. That protocol is only needed for observer nodes. Active participants in the network will follow the EC chain and will learn these power tables through the EC chain. 2. That protocol won't need as much guessing because we'll _know_ which power tables should be available given the latest certificate we've received. The large remaining TODOs are tests and design documentation. The entire protocol has been in constant flux so... I'm sure there are some inconsistencies...
It was necessary when we subscribed to new certificates, but not anymore.
It makes it harder to test.
c5a1e7d
to
5a3dbe0
Compare
431259b
to
665d845
Compare
and tests
Race tests are failing because I'm using real time.... I'll need to find a way to avoid that. |
7b71c39
to
5ca20b8
Compare
e2622a4
to
33e8117
Compare
s.Log.Debugf("polling %d peers for instance %d", len(peers), s.poller.NextInstance) | ||
for _, peer := range peers { | ||
oldInstance := s.poller.NextInstance | ||
res, err := s.poller.Poll(ctx, peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we need a timeout here to prevent peers from stalling us. It can be an issue for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "client" can (and should) be configured with a timeout. Although varying the timeout over time given the average request time would be a nice touch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGWM, there is some complexity I wish we could avoid, but the real world is annoying.
Fuzz test failed on commit 6dd8e20. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9797910777 -n testdata Alternatively, download directly from here. |
This implements a basic certificate exchange protocol (which still needs tests and is probably slightly broken at the moment).
Importantly, it's missing the ability to fetch power table deltas for validating future instances (beyond the latest certificate). My plan is to implement this as a somewhat separate protocol (likely re-using a lot of the same machinery). However:
The large remaining TODOs are tests and design documentation. The entire protocol has been in constant flux so... I'm sure there are some inconsistencies...