Warning: This is a proof-of-concept experiment. Use on a production system at your own risk!
Authors: Ferenc Orosi, Ferenc Fejes
TSN metadata proxy is a thin CNI plugin enable time-sensitive microservices to use the TSN network. Linux provide time-sensitive APIs through the socket API. With socket options or ancillary messages the application can set priority or scheduling metadata. If the application deployed as a microservice e.g.: Kubernetes pod, this metadata removed by the Linux kernel's network namespace isolation mechanism. This CNI plugin ensures the metadata is preserved even after the packets leave the pod.
Main features:
- Does not require modification of the time-sensitive microservice
- Compatible with various Kubernetes distributions (like normal deployment, Minikube or Kind)
- Act as secondary CNI plugin used together with a primary plugin like Calico, Kindnet, Flannel, etc.
- Single
kubectl apply
configuration
- Kubernetes with Docker container engine
- Linux with eBPF enabled
A minimal example deployment consist:
talker
- a time-sensitive microservice pod with application(s) insidebridge
- a bridge created by the primary CNI plugin in host (node
) network namespacenode
- Linux host Kubernetes node with physical TSN network interfacetsn0
┌─────────────────────────────────────────────────────────────────┐
│ node│
│ ┏━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━┓ │
│ ┃talker (pod) ┃ ┃bridge ┃ ┌───────────┤
│ ┃ ┌───────────┨ ┠───────────┐ ┃ │ │
│ ┃ │eth0 (veth)┃ ┃veth_xxxxx │ ┠ ─ ─ ─ ▶│tsn0 (NIC) │
│ ┃ │ ┠───────▶┃(veth) │ ┃ └────●──────┤
│ ┗━━━━┷━━━●━━━━━━━┛ ┗━━━━━━●━━━━┷━━━━┛ ┌──┘ │
│ ┌┘ ┌─┘ TSN compatible │
│ eBPF program (saver) eBPF program network card, │
│ to store TSN metadata (restorer) with TSN Qdisc │
│ to retore TSN │
│ metadata │
└─────────────────────────────────────────────────────────────────┘
In this deployment there is a talker
which is a pod.
The application inside talker
will generate traffic with priority metadata.
The destination address of the metadata is outside of the node.
Therefore the packet leaves the pod network namespace,
goes straight to the CNI bridge in the node
network namespace,
where it will be NAT-ed and routed and sent towards the tsn0
NIC.
On the NIC traffic control Qdisc configured such as taprio
or mqprio
.
First, build the proxy container image that compile BPF
programs
and include all the necessary components of the TSN proxy.
This required for every Kubernetes setup, same image
can be used for node deployment or Minikube, Kind based test setup.
docker build . -t tsn-metadata-proxy:latest -f tsn-metadata-proxy.Dockerfile
In the rest of the guide, we only focusing on Kind and Minikube.
These are popular Kubernetes distributions for quick deployment testing.
Compared to the example deployment diagram above,
if Kind or Minikube used, the node
itself also a pod.
In normal production deployments kubelet
and other core components,
are running withing the node
's root network namespace.
Start a Kubernetes host with a custom configuration that mounts the BPF
file system:
kind create cluster --name=demo --config cluster.yaml
The content of the cluster.yaml
is the following:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
disableDefaultCNI: false
nodes:
- role: control-plane
extraMounts:
- hostPath: /sys/fs/bpf
containerPath: /sys/fs/bpf
- role: worker
extraMounts:
- hostPath: /sys/fs/bpf
containerPath: /sys/fs/bpf
There are two nodes, a worker and a control-plane.
The BPF filesystem /sys/fs/bpf/
exposed for both node in the default path.
This setup will create a cluster using the default
CNI
, which is Kindnet. To replace it with anotherCNI
, setdisableDefaultCNI: false
totrue
in thecluster.yaml
file. This change only disables the defaultCNI
. However, for functional networking basic CNI plugins likebridge
must be downloaded separately:docker exec -it demo-control-plane /bin/bash curl -LO https://github.com/containernetworking/plugins/releases/download/v1.1.1/cni-plugins-linux-amd64-v1.1.1.tgz tar -xvzf cni-plugins-linux-amd64-v1.1.1.tgz -C /opt/cni/bin
Once ready, you can deploy the
DaemonSet
of your desiredCNI
for example Flannel:kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.ymlSimilar for Calico (only choose one, do not install multiple primary CNI plugins)
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Copy the custom Docker
images to the Kubernetes control plane:
kind load docker-image tsn-metadata-proxy:latest --name demo
On older kind
versions this method might not work.
If that is the case, try the following two commands:
TMPDIR=<any folder> kind load docker-image tsn-metadata-proxy:latest --name demo
If TMPDIR=/tmp
the metadata proxy CNI will be deleted on node restart.
It is advised to use different folder!
Check if the Docker
image is loaded:
docker exec -it demo-control-plane crictl images
To check the running pods in a Kubernetes cluster use the following command (in new terminal):
watch -n 0.1 kubectl get pods
# or kubect get pods -w
With having the Kubernetes setup ready deploy the TSN metadata proxy:
kubectl apply -f daemonset.yaml
Verify the necessary files are present on the control plane,
the BPF
map is created, and the CNI plgins config includes the TSN plugin
.
(Note: if you use other CNI
, replace 10-kindnet.conflist
):
docker exec -it demo-control-plane ls -la /opt/cni/bin/
docker exec -it demo-control-plane ls -la /sys/fs/bpf
docker exec -it demo-control-plane ls -la /sys/fs/bpf/demo-control-plane
docker exec -it demo-control-plane cat /etc/cni/net.d/10-kindnet.conflist
Normally on a VM or bare-metal node the testing would require the config of the TSN NIC.
We have an Kind cluster, therefore the nodes are emulated as well.
To continue, configure the taprio
qdisc on the tsn0
NIC.
Note: by default tsn0
is likely named as eth0
, the tsn0
used here to stick with the example figure
Right now this happened to have on the demo-control-plane
node:
docker exec -it demo-control-plane ethtool -L tsn0 tx 4
docker exec -it demo-control-plane tc qdisc replace dev tsn0 parent root handle 100 stab overhead 24 taprio \
num_tc 4 \
map 0 1 2 3 \
queues 1@0 1@1 1@2 1@3 \
base-time 1693996801300000000 \
sched-entry S 03 10000 \
sched-entry S 05 10000 \
sched-entry S 09 20000 \
clockid CLOCK_TAI
For testing purposes, we deploy a pod using the alpine/socat
image as the base image
With this Kind based Kubernetes setup, we need to tell where we want to schedule the pod.
For this, modify nodeSelector
in talker.yaml
to select the control plane:
kubectl apply -f talker.yaml
When deploying a pod on the control plane in our multi-node setup, we have to remove NoSchedule
taint.
This taint set by default on nodes with role control-plane
.
To remove this taint, we can use the -
(minus) sing with the NoSchedule
role:
kubectl taint nodes demo-control-plane node-role.kubernetes.io/control-plane:NoSchedule-
Now we have to generate traffic, and send packets with different priorities.
To do that, execute socat
from a terminal in the talker
node:
kubectl exec -it talker -- socat udp:8.8.8.8:1234,so-priority=1 -
Verify that packet metadata were preserved (restored), we can check statistics of the taprio
qdisc:
docker exec -it demo-control-plane tc -s -d qdisc show dev tsn0
As in the output below, packets with priority 1 are transmitted through the second queue.
In the output queue numbering starts with 1 and priorities starting with 0.
Therefore qdisc pfifo 0: parent 100:2
is the second taprio
leaf (priority 1).
qdisc taprio 100: root refcnt 9 tc 4 map 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0
queues offset 0 count 1 offset 1 count 1 offset 2 count 1 offset 3 count 1
clockid TAI base-time 1693996801300000000 cycle-time 40000 cycle-time-extension 0
index 0 cmd S gatemask 0x3 interval 10000
index 1 cmd S gatemask 0x5 interval 10000
index 2 cmd S gatemask 0x9 interval 20000
overhead 24
Sent 10717397 bytes 17292 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:4 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:3 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:2 limit 1000p
Sent 2010 bytes 30 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:1 limit 1000p
Sent 10713846 bytes 17239 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Delete the whole cluster:
kind delete clusters --all
If the BPF programs remain loaded, you can remove them using the following commands:
On a real (not Kind emulated cluster) the rm
commands must be executed on every node.
sudo rm /sys/fs/bpf/restorer
sudo rm /sys/fs/bpf/saver
sudo rm /sys/fs/bpf/tracking
sudo rm /sys/fs/bpf/garbage_collector
sudo rm /sys/fs/bpf/timer_map
sudo rm /sys/fs/bpf/tsn_map
sudo rm -r /sys/fs/bpf/demo-control-plane
sudo rm -r /sys/fs/bpf/demo-worker
Start a Kubernetes cluster, mount the BPF
file system, and set the CNI
plugin to Flannel
.
With Minikube no cluster manifest required.
As a limitation Minikube can only emulate single node cluster.
minikube start --mount-string="/sys/fs/bpf:/sys/fs/bpf" --mount --cni=flannel
Add the TSN metadata proxy container image to the Kubernetes local repository:
minikube image load tsn-metadata-proxy:latest
Check if the Docker
image is loaded:
docker exec -it minikube crictl images
To watch the running pods in a Kubernetes cluster, we can open a new terminal and run the following command:
watch -n 0.1 kubectl get pods
# or kubect get pods -w
With having the Kubernetes setup ready deploy the TSN metadata proxy:
kubectl apply -f daemonset.yaml
Verify the necessary files are present on the control plane,
the BPF
map is created, and the CNI plgins config includes the TSN plugin
.
(Note: if you use other CNI
, replace 10-flannel.conflist
):
docker exec -it minikube ls -la /opt/cni/bin/
docker exec -it minikube ls -la /sys/fs/bpf
docker exec -it minikube ls -la /sys/fs/bpf/minikube
docker exec -it minikube cat /etc/cni/net.d/10-flannel.conflist
Normally on a VM or bare-metal node the testing would require the config of the TSN NIC.
We have a Minikube cluster, therefore the nodes are emulated as well.
To continue, configure the taprio
qdisc on the tsn0
NIC.
Note: by default tsn0
is likely named as eth0
, the tsn0
used here to stick with the example figure.
Right now this happened to have on the minikube
node:
docker exec -it minikube ethtool -L tsn0 tx 4
docker exec -it minikube tc qdisc replace dev tsn0 parent root handle 100 stab overhead 24 taprio \
num_tc 4 \
map 0 1 2 3 \
queues 1@0 1@1 1@2 1@3 \
base-time 1693996801300000000 \
sched-entry S 03 10000 \
sched-entry S 05 10000 \
sched-entry S 09 20000 \
clockid CLOCK_TAI
For testing purposes, we deploy a pod using the alpine/socat
image as the base image.
kubectl apply -f talker.yaml
Now we have to generate traffic, and send packets with different priorities.
To do that, execute socat
from a terminal in the talker
node:
kubectl exec -it talker -- socat udp:8.8.8.8:1234,so-priority=1 -
Verify that packet metadata were preserved (restored), we can check statistics of the taprio
qdisc:
docker exec -it minikube tc -s -d qdisc show dev tsn0
As in the output below, packets with priority 1 are transmitted through the second queue.
In the output queue numbering starts with 1 and priorities starting with 0.
Therefore qdisc pfifo 0: parent 100:2
is the second taprio
leaf (priority 1).
qdisc taprio 100: root refcnt 9 tc 4 map 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0
queues offset 0 count 1 offset 1 count 1 offset 2 count 1 offset 3 count 1
clockid TAI base-time 1693996801300000000 cycle-time 40000 cycle-time-extension 0
index 0 cmd S gatemask 0x3 interval 10000
index 1 cmd S gatemask 0x5 interval 10000
index 2 cmd S gatemask 0x9 interval 20000
overhead 24
Sent 3366617 bytes 5240 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:4 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:3 limit 1000p
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:2 limit 1000p
Sent 2709 bytes 40 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 100:1 limit 1000p
Sent 3363908 bytes 5200 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Delete the whole cluster (sudo is important!!!):
sudo minikube delete --all
If there is an error, and the cleanup cannot complete, we can remove every remnant of the TSN metadata proxy's with the following commands. This does not uninstall the TSN metadata proxy CNI plugin itself, only the running configs.
sudo rm /sys/fs/bpf/restorer
sudo rm /sys/fs/bpf/saver
sudo rm /sys/fs/bpf/tracking
sudo rm /sys/fs/bpf/garbage_collector
sudo rm /sys/fs/bpf/timer_map
sudo rm /sys/fs/bpf/tsn_map
sudo rm -r /sys/fs/bpf/minikube/
This work was supported by the European Union’s Horizon 2020 research and innovation programme through DETERMINISTIC6G project under Grant Agreement no. 101096504