Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement certificate exchange #378

Merged
merged 44 commits into from
Jul 4, 2024
Merged

Implement certificate exchange #378

merged 44 commits into from
Jul 4, 2024

Conversation

Stebalien
Copy link
Member

This implements a basic certificate exchange protocol (which still needs tests and is probably slightly broken at the moment).

Importantly, it's missing the ability to fetch power table deltas for validating future instances (beyond the latest certificate). My plan is to implement this as a somewhat separate protocol (likely re-using a lot of the same machinery). However:

  1. That protocol is only needed for observer nodes. Active participants in the network will follow the EC chain and will learn these power tables through the EC chain.
  2. That protocol won't need as much guessing because we'll know which power tables should be available given the latest certificate we've received.

The large remaining TODOs are tests and design documentation. The entire protocol has been in constant flux so... I'm sure there are some inconsistencies...

@Stebalien Stebalien requested review from masih and Kubuxu and removed request for Kubuxu June 27, 2024 04:54
@Stebalien
Copy link
Member Author

@masih I've tagged you for review in case you want to take a look, but I'm also happy to go over it with you synchronously tomorrow morning (that'll also help me figure out the parts I need to document and find anything I might have missed).

@Stebalien Stebalien force-pushed the steb/certexchange branch from 2fe69aa to cf578f3 Compare June 27, 2024 04:56
Copy link

codecov bot commented Jun 27, 2024

Codecov Report

Attention: Patch coverage is 77.91304% with 127 lines in your changes missing coverage. Please review.

Project coverage is 79.14%. Comparing base (af34f65) to head (c050802).
Report is 4 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #378      +/-   ##
==========================================
+ Coverage   79.12%   79.14%   +0.02%     
==========================================
  Files          27       36       +9     
  Lines        2836     3472     +636     
==========================================
+ Hits         2244     2748     +504     
- Misses        361      451      +90     
- Partials      231      273      +42     
Files Coverage Δ
certexchange/polling/predictor.go 100.00% <100.00%> (ø)
certexchange/protocol.go 100.00% <100.00%> (ø)
certexchange/polling/pollstatus_string.go 0.00% <0.00%> (ø)
certexchange/polling/discovery.go 68.88% <68.88%> (ø)
certexchange/polling/poller.go 77.94% <77.94%> (ø)
certexchange/polling/subscriber.go 78.88% <78.88%> (ø)
certexchange/server.go 68.85% <68.85%> (ø)
certexchange/client.go 65.15% <65.15%> (ø)
certexchange/polling/peerTracker.go 83.58% <83.58%> (ø)

... and 7 files with indirect coverage changes

@Stebalien Stebalien force-pushed the steb/certexchange branch 2 times, most recently from e92cc19 to 1a5fd80 Compare June 27, 2024 05:00
Copy link
Member

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Steven 👍

go.mod Outdated Show resolved Hide resolved
gen/main.go Show resolved Hide resolved
certexchange/protocol.go Outdated Show resolved Hide resolved
}

func (s *Subscriber) Stop() error {
s.stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'd avoid panics where possible in case Start is not called by checking if stop is nil.
  • take context since this might take a bit of time to stop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the first one.

I considered taking a context but stopping is purely cleanup. That is, there's nothing to cancel except, well, the process of cancelling.

certexchange/polling/subscriber.go Outdated Show resolved Hide resolved
//
// We don't actually know the _offset_... but whatever. We can keep up +/- one instance and that's
// fine (especially because of the power table lag, etc.).
func (p *predictor) update(progress uint64) time.Duration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we probably want to make the algo here plugglable. there are all sorts of optimisations we can do for faster cert exchage depending on environmental variables.
What's here makes a pretty sweet default 👍

minExploreDistance = 100 * time.Millisecond
)

func newPredictor(targetAccuracy float64, minInterval, defaultInterval, maxInterval time.Duration) *predictor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for this would be awesome for a set of well known scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need a lot of tests.

received := 0
for cert := range ch {
// TODO: consider batching verification, it's slightly faster.
next, _, pt, err := certs.ValidateFinalityCertificates(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later research: I'd be curious to see if there is a correlation between the ideal interval and the length of ECChain, or the number of rounds it took to finalise it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but, I don't think we'll get much from that:

  1. In terms of EC chain length, the instance after a GPBFT stall will have a long chain, but the instance after that will likely be normal unless the stall was REALLY long.
  2. In terms of rounds, measuring interval times should be a reasonable proxy.

Also, in terms of rounds, I'm using exponential backoff for both misses and "explore distance" to try to align with the round backoff.

return items
}

// Knuth 3.4.2S. Could use rand.Perm, but that would allocate a large array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

// An interval predictor that tries to predict the time between instances. It can't predict the time
// an instance will be available, but it'll keep adjusting the interval until we receive one
// instance per interval.
type predictor struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is plenty good as is. Way too early to optimise but writing down thoughts: In the future I suspect we can improve it further by piggybacking other peer's stats over exchanged certs request/response and feed it to a stochastic model to fine tune intervals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have all the information we need locally and I'm weary of trusting what other peers tell us too much (too easy to bias/attack us). The main improvements I can see are:

  1. Using the number of certificates. E.g., divide the current interval by the number of certificates received. I didn't do this yet because I don't want to adjust our prediction too quickly (especially if the interval is noisy).
  2. Have some mechanism to discover how intervals should be aligned.

But... I don't think either actually matter.

  1. I still want a pubsub based protocol which will allow us to keep up completely.
  2. For GPBFT's purposes, we just need to be within 5 (the power table lookback) instances.

TBH, if we're too accurate, we'll constantly be making unnecessary network requests even when GPBFT is working perfectly and doesn't need them. That was actually the main issue with my previous attempt at this algorithm (it tried to find the exact "edge" where a new certificate would be released).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there are probably second-order facts we could learn to let us converge to the correct interval faster. E.g., given a sequence of events/choices, what choice should we make next that's most likely to find the optimal interval.

(we could probably straight-up use transformers for this...)

Copy link

Fuzz test failed on commit d9e41a3. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9700329175 -n testdata

Aleternatively, download directly from here.

Copy link

Fuzz test failed on commit e5cf0a7. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9700533859 -n testdata

Aleternatively, download directly from here.

@Stebalien Stebalien force-pushed the steb/certexchange branch from 65865a7 to 96c26bd Compare June 28, 2024 00:29
@Stebalien
Copy link
Member Author

Ok, really, the predictor should probably be a rolling window where we track how many certificates we received inside that rolling window. Then we predict windowLength/certificatesReceived. But, IMO, it works well enough for now.

@Stebalien Stebalien marked this pull request as ready for review June 28, 2024 15:20
@Stebalien
Copy link
Member Author

Still writing tests, but this should be ready for review.

Stebalien added 10 commits June 28, 2024 10:28
This implements a basic certificate exchange protocol (which still needs
tests and is probably slightly broken at the moment).

Importantly, it's missing the ability to fetch power table deltas for
validating future instances (beyond the latest certificate). My plan is
to implement this as a somewhat separate protocol (likely re-using a lot
of the same machinery). However:

1. That protocol is only needed for observer nodes. Active participants
in the network will follow the EC chain and will learn these power
tables through the EC chain.
2. That protocol won't need as much guessing because we'll _know_ which
power tables should be available given the latest certificate we've
received.

The large remaining TODOs are tests and design documentation. The entire
protocol has been in constant flux so... I'm sure there are some
inconsistencies...
It was necessary when we subscribed to new certificates, but not anymore.
It makes it harder to test.
@Stebalien Stebalien force-pushed the steb/certexchange branch from c5a1e7d to 5a3dbe0 Compare June 28, 2024 17:28
@Stebalien Stebalien force-pushed the steb/certexchange branch from 431259b to 665d845 Compare June 28, 2024 22:15
@Stebalien
Copy link
Member Author

Race tests are failing because I'm using real time.... I'll need to find a way to avoid that.

@Stebalien Stebalien force-pushed the steb/certexchange branch from 7b71c39 to 5ca20b8 Compare July 4, 2024 09:35
@Stebalien Stebalien force-pushed the steb/certexchange branch from e2622a4 to 33e8117 Compare July 4, 2024 12:00
s.Log.Debugf("polling %d peers for instance %d", len(peers), s.poller.NextInstance)
for _, peer := range peers {
oldInstance := s.poller.NextInstance
res, err := s.poller.Poll(ctx, peer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we need a timeout here to prevent peers from stalling us. It can be an issue for the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "client" can (and should) be configured with a timeout. Although varying the timeout over time given the average request time would be a nice touch.

Copy link
Contributor

@Kubuxu Kubuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGWM, there is some complexity I wish we could avoid, but the real world is annoying.

Copy link

github-actions bot commented Jul 4, 2024

Fuzz test failed on commit 6dd8e20. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9797910777 -n testdata

Alternatively, download directly from here.

@Stebalien Stebalien added this pull request to the merge queue Jul 4, 2024
Merged via the queue into main with commit 345d165 Jul 4, 2024
7 of 8 checks passed
@Stebalien Stebalien deleted the steb/certexchange branch July 4, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants