- 
                Notifications
    You must be signed in to change notification settings 
- Fork 259
Cluster Network MTU
If the cluster admin does not specify the cluster MTU explicitly in the network operator config at install time (.spec.defaultNetwork.openshiftSDNConfig.MTU or .spec.defaultNetwork.ovnKubernetesConfig.MTU), then the CNO will autodetect it when it first starts during the initial cluster install. It does this by looking at the MTU of the default network interface on the (master) node where CNO is running, and then adding room for VXLAN/Geneve tunnel overhead (as described below). After detecting the MTU at install time, CNO will not change it again.
This means:
- If the masters' default interfaces don't have the right MTU, then CNO will set an incorrect cluster MTU based on the incorrect interface MTU, and changing the network config WILL NOT fix this because then you're just telling openshift-sdn/ovn-kubernetes to use an MTU which is incompatible with the network it is running on. You need to fix the nodes to set the right MTU for their interfaces.
- If the masters have a higher MTU than the workers, then CNO will set an incorrect cluster MTU because it just assumes that the workers are going to have the same MTU as the masters. (This can happen, eg, if you use very different instance types for masters and workers in AWS, or if you have a cluster where the masters are bare metal but the workers are VMs, etc.) In this case, manually overriding the MTU in the network operator config is the correct solution.
- (If the masters have a lower MTU than the workers then everything will be fine, since the cluster network MTU has to be based on the lowest node MTU in the cluster.)
 
- If you run CNO via ./hack/run-locally.shfor development purposes, then CNO will set the MTU for the cluster based off the MTU of your local development machine. This means you may end up with a cluster MTU of 1450 on an AWS cluster with MTU 9001 nodes. (More problematic would be if you had a jumbo MTU locally, and a smaller MTU in the cluster; then you'd have to manually override the cluster MTU in order to make things work right.)
The MTU of an interface is the maximum size of an ethernet payload. So, eg, an interface with an MTU of 1500 can carry packets of (at least) 1518 bytes: a 14-byte ethernet header plus a 1500-byte payload plus a 4-byte checksum. The 1500 byte payload could be, eg, a UDPv4 packet consisting of a 20-byte IPv4 header, an 8-byte UDP header, and a 1472-byte UDP payload.
If that UDP payload is a VXLAN packet, then (per RFC 7348) it consists of 8 bytes for the fixed VXLAN header, followed by 14 bytes for the inner ethernet header, followed by the inner ethernet payload. (There is no inner trailing checksum.) That means the MTU (largest inner ethernet payload) of the VXLAN tunnel is 1500-20-8-8-14=1450 bytes, ie 50 bytes less than the MTU of the parent interface. OTOH, if the VXLAN packet is being sent over IPv6 then the outer UDPv6 packet has a 40-byte IPv6 header (assuming no IPv6 options) leaving less room for the inner ethernet payload, so the inner MTU is 1430. (It doesn't matter whether the inner packets are IPv4 or IPv6 because the MTU gives the size of the inner ethernet payload, not the inner TCP/UDP payload.)
For Geneve (per draft-ietf-nvo3-geneve-16), there is also a fixed 8-byte header, but also possibly options. According to ovn-architecture(7), OVN uses a single additional 32-bit option for datapath information, so that's an extra 4 bytes of Geneve metadata describing the option, and 4 bytes of actual OVN datapath data. So the MTU for OVN-Kubernetes should be 58 less than the parent MTU.
So:
- OpenShift SDN: base MTU minus 50
- OVN Kubernetes: base MTU minus 58
- IPv6 or IPv6-primary dual-stack: subtract another 20 bytes
- IPv4-primary dual-stack: same as for IPv4, because we'll use the nodes' IPv4 IPs as the tunnel endpoints, so the IPv6-ness is irrelevant
- IPsec: subtract another 46 bytes
If, in the future, we implement "namespace tagging" / "packet marking" / whatever in OVN Kubernetes, that will probably require another 8 bytes of Geneve options.