Thoughts on adding and removing nodes #774

martijnbastiaan · 2025-05-16T22:07:23Z

martijnbastiaan
May 16, 2025
Maintainer

So far our proof of concepts have considered any failures (e.g., transceiver loss of lock, irreparable line encoding errors, or buffer overflows) as unrecoverable: we terminate the test and consider it a failure. For large scale systems this isn't an acceptable course of action: failures will happen and a Bittide network will have to deal with it gracefully.

In this discussion I'm mostly thinking from a clock control perspective, i.e., what happens to buffers, how should handshaking work, etc.

Status quo

To get a feel for what we could do w.r.t. joining and leaving a network we should first get a feel for how a node boots currently. Let's consider our current hardware implementation (limited to 4 nodes to simplify the drawing):

An FPGA steps through the following script:

It waits for its always on clock to be stable.
It programs its controllable clock (clock synth in the diagram).
It waits for the controllable clock to be stable.
It deasserts the reset of the transceiver subsystems and waits for the transceiver subsystems to indicate that they have set up a stable link. At this point, the FPGA sends out its own controllable clock on its outgoing links and recovers N (neighbor-)controllable clocks on its incoming links. Now that the low-level handshake between the transceivers is done the link is considered up. No bittide data is being exchanged just yet.
- Note that while a link is up, it must transfer data every clock tick. While it is not sending bittide data, PRBS data plus control bits are sent. The control bits allow the nodes to communicate whether they're ready to receive "real" (bittide) data and whether the next word they're going to send is PRBS data still or bittide data.
It deasserts the rest of the clock control subsystem. Even though no user data is being exchanged between the transceivers, clock control can do its work: it only needs to observe clock speed differences between its own clock and the recovered clocks.
It waits until clock control indicates it is stable, i.e., roughly in sync with its neighbors.
It deasserts the resets of its elastic buffers. These will roughly stay at their midpoints now, due to the synchronized clocks.
It indicates to its neighbors that it is ready to receive bittide data.
It waits for its neighbors to indicate that they are ready to receive bittide data. The very first frame that is sent over the link as bittide is the local clock counter. After that, the link is directly connected to the switch, sending whatever the switch connects its input to. After this a link is considered to run in unscheduled mode. I.e., UGNs haven't been sent to a scheduler and no calendars/programs have been exchanged yet.

A goal for 2025 is to extend these steps such that arbitrary communication is possible in unscheduled mode, which can then be used to collect UGNs and transition into scheduled mode (also see #757 and #758).

Adding nodes gradually

For a few steps (particularly 4 and 7) a node waits for all its neighbors to be online. This is an unrealistic property: in real systems we want to extend networks on-the-fly. To account for this, we change the algorithm to only add one node at a time. Though links can still be negotiated (4) in parallel, clock control will only consider one more link at a time. To pick a synchronization tactic, we add a control bit that indicates whether a node thinks it is part of a synchronized network (UNSYNCED vs SYNCED).

Connecting an `UNSYNCED` node to a `SYNCED` one

The unsynchronized node will observe the SYNCED one's clock and line up with it. Meanwhile, the SYNCED one will ignore the UNSYNCED one's clock until the unsynchronized node indicates it is no part of a synchronized network. (Very much analogous to attaching a new electricity generator to a running grid.) To keep symmetry between the buffer occupancies, the UNSYNCED node should reset its counters after synchronizing and before joining the network.

Connecting two `UNSYNCED` nodes

Both UNSYNCED nodes consider, by definition, no other neighbors yet. Both nodes pick a leader (either by comparing unique IDs or random election) and resolve the situation like "Connecting an UNSYNCED node to a SYNCED one".

Connecting two `SYNCED` nodes (no split brain)

Both nodes should observe that they're already synced, albeit it through other nodes. The link can be brought to unscheduled mode without further synchronization.

Connecting two `SYNCED` nodes (split brain)

The trickiest situation is where both nodes consider themselves synchronized, but they would form the link between two unconnected graphs. In this case the nodes will (most probably) observe that they're clock frequencies don't match. Perhaps the management unit can start applying pressure towards a the other's frequency by submitting FINC/FDEC pulses independent of clock control. This should be fine for the clock control algorithm, as it should not be distinguishable from natural causes (heating up / cooling down of the oscillator).

Notes

A hostile (or failing) node could try to drag a node/cluster down by repeatedly connecting to a network, claiming to already be synchronized. Nodes should refuse to lower/up their clocks by a set amount -- based on the maximum natural clock divergence set by the oscillator's manufacturers.
After connecting a split brain, it is more likely that buffer occupancies become off-center (citation needed :-)). It might make sense to refuse to add another node until after the next reframe event.
Adding a node with centered buffers is a no-op for clock control.

Removing nodes

I think we can distinguish between two ways of removing a node:

Sudden removal: a node is looking fine one clock cycle (buffers ~half full) and dead on the next. This can happen in case of (for example) loss-of-lock, decoding errors, or hardware failures.
Slow removal: i.e., the links look fine, but buffers start to under- or overflow.

A sudden removal is less painful for the network, as it doesn't have a chance to influence clock frequencies and therefore how "tokens" in elastic buffers are distributed across the system. A slow removal does do that. In the slow case, nodes should refuse to re-add nodes without reframing. In both cases clock control should freeze buffer occupancies to the last known state, only resetting them to their midpoints on reframing.

Open questions

Re-adding a node will yield different UGNs each time. We could try to stabilize the RTT UGN between boots, would this help a scheduler?
I feel it would almost be impossible to re-route immediately after a node disappears, as scheduling is way to expensive to do that. It seems to me that the application layer should be designed such that problems can easily be shifted to other "circuits" that don't involve the disappeared node..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thoughts on adding and removing nodes #774

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Thoughts on adding and removing nodes #774

Uh oh!

martijnbastiaan May 16, 2025 Maintainer

Status quo

Adding nodes gradually

Connecting an UNSYNCED node to a SYNCED one

Connecting two UNSYNCED nodes

Connecting two SYNCED nodes (no split brain)

Connecting two SYNCED nodes (split brain)

Notes

Removing nodes

Open questions

Replies: 0 comments

martijnbastiaan
May 16, 2025
Maintainer

Connecting an `UNSYNCED` node to a `SYNCED` one

Connecting two `UNSYNCED` nodes

Connecting two `SYNCED` nodes (no split brain)

Connecting two `SYNCED` nodes (split brain)