Thoughts on adding and removing nodes #774
martijnbastiaan
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
So far our proof of concepts have considered any failures (e.g., transceiver loss of lock, irreparable line encoding errors, or buffer overflows) as unrecoverable: we terminate the test and consider it a failure. For large scale systems this isn't an acceptable course of action: failures will happen and a Bittide network will have to deal with it gracefully.
In this discussion I'm mostly thinking from a clock control perspective, i.e., what happens to buffers, how should handshaking work, etc.
Status quo
To get a feel for what we could do w.r.t. joining and leaving a network we should first get a feel for how a node boots currently. Let's consider our current hardware implementation (limited to 4 nodes to simplify the drawing):
An FPGA steps through the following script:
A goal for 2025 is to extend these steps such that arbitrary communication is possible in unscheduled mode, which can then be used to collect UGNs and transition into scheduled mode (also see #757 and #758).
Adding nodes gradually
For a few steps (particularly 4 and 7) a node waits for all its neighbors to be online. This is an unrealistic property: in real systems we want to extend networks on-the-fly. To account for this, we change the algorithm to only add one node at a time. Though links can still be negotiated (4) in parallel, clock control will only consider one more link at a time. To pick a synchronization tactic, we add a control bit that indicates whether a node thinks it is part of a synchronized network (
UNSYNCEDvsSYNCED).Connecting an
UNSYNCEDnode to aSYNCEDoneThe unsynchronized node will observe the
SYNCEDone's clock and line up with it. Meanwhile, theSYNCEDone will ignore theUNSYNCEDone's clock until the unsynchronized node indicates it is no part of a synchronized network. (Very much analogous to attaching a new electricity generator to a running grid.) To keep symmetry between the buffer occupancies, theUNSYNCEDnode should reset its counters after synchronizing and before joining the network.Connecting two
UNSYNCEDnodesBoth
UNSYNCEDnodes consider, by definition, no other neighbors yet. Both nodes pick a leader (either by comparing unique IDs or random election) and resolve the situation like "Connecting anUNSYNCEDnode to aSYNCEDone".Connecting two
SYNCEDnodes (no split brain)Both nodes should observe that they're already synced, albeit it through other nodes. The link can be brought to unscheduled mode without further synchronization.
Connecting two
SYNCEDnodes (split brain)The trickiest situation is where both nodes consider themselves synchronized, but they would form the link between two unconnected graphs. In this case the nodes will (most probably) observe that they're clock frequencies don't match. Perhaps the management unit can start applying pressure towards a the other's frequency by submitting FINC/FDEC pulses independent of clock control. This should be fine for the clock control algorithm, as it should not be distinguishable from natural causes (heating up / cooling down of the oscillator).
Notes
Removing nodes
I think we can distinguish between two ways of removing a node:
A sudden removal is less painful for the network, as it doesn't have a chance to influence clock frequencies and therefore how "tokens" in elastic buffers are distributed across the system. A slow removal does do that. In the slow case, nodes should refuse to re-add nodes without reframing. In both cases clock control should freeze buffer occupancies to the last known state, only resetting them to their midpoints on reframing.
Open questions
Beta Was this translation helpful? Give feedback.
All reactions