KEP-4963: Kube-proxy Services Acceleration #4964

aojea · 2024-11-15T05:00:52Z

One-line PR description: Service acceleration for Kube-proxy using netfilter’s flowtable infrastructure
Issue link: Kube-proxy Services Acceleration #4963
Other comments: See prototype implementation in [WIP] KEP-4963: Kube-proxy Services Acceleration kubernetes#128392

k8s-ci-robot · 2024-11-15T05:01:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
~~keps/sig-network/OWNERS~~ [aojea]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2024-11-28T12:11:26Z

/assign @thockin @danwinship

I leave up to you if is worth to go through the feature gate process if this is an opt-in option

keps/sig-network/4963-kube-proxy-flowtables-fastpath/README.md

adrianmoisey · 2024-11-28T12:16:46Z

I like this proposal. Giving users the choice to opt in makes a lot of sense and reduces risk.

danwinship

So I do like the idea of using flowtables in Kubernetes, though I'm not sure we've figured out all of the details of how it makes the most sense yet...

keps/sig-network/4963-kube-proxy-flowtables-fastpath/README.md

danwinship · 2024-12-02T18:03:53Z

keps/sig-network/4963-kube-proxy-flowtables-fastpath/README.md

+
+Users will be able to opt-in to Service traffic acceleration by passing a CEL expression using the flag `--accelerated-interface-expression` or the configuration option `AcceleratedInterfaceExpression` to match the network interfaces in the node that are subjet to Service traffic acceleration. The absence of a CEL expression disables the feature.
+
+Kube-proxy will create a `flowtable` in the kube-proxy table with the name `kube-proxy-flowtable` and will monitor the network interfaces in the node to populate the `flowtable` with the interfaces that match the configured CEL expression.


At some point we need to investigate the performance implications of this, since it seems clear that this isn't really how flowtables were meant to be used, and there could potentially be things that are O(n^2) in the length of devices.

(Also, I wouldn't be surprised if it turned out that recreating the flowtable with an updated devices value discards all existing flow offload entries. This isn't horrible since the offload will just get recreated on the next packet, but it's not ideal and suggests maybe we should be working with the netfilter team to optimize this all a bit more...)

(This is maybe a topic for "Risks and Mitigations".)

Good point, I also need to consult this with the netfilter folks

At some point we need to investigate the performance implications of this, since it seems clear that this isn't really how flowtables were meant to be used, and there could potentially be things that are O(n^2) in the length of devices.

Regarding "since it seems clear that this isn't really how flowtables were meant to be used", could you expand on that? I'm unsure what the intended use of flowtables is, and how this use case is unintended.

I'm assuming this may be written down somewhere, so happy if you point me at a readme or similar

It's not really written down, and I don't know exactly what the model is. But if they expected people to set up flows between every pair of devices on the node, then they wouldn't have required you to enumerate specific devices, right?

But if they expected people to set up flows between every pair of devices on the node, then they wouldn't have required you to enumerate specific devices, right?

AFAIK the flowtable (cache table) is per device, so this may be linking the specific flow to the specific device

danwinship · 2024-12-02T18:08:52Z

keps/sig-network/4963-kube-proxy-flowtables-fastpath/README.md

+
+### Upgrade / Downgrade Strategy
+
+kube-proxy reconcile the nftables rules so the rules will be reconciled during startup and added or removed depending on how kube-proxy is configured.


FWIW when downgrading, any existing flow table entries will be preserved, so any Service connections that remain up across a kube-proxy downgrade would still be offloaded and there'd be no way to un-offload them with the older kube-proxy.

(That could be mitigated by recommending that people restart kube-proxy with the feature disabled before downgrading.)

(Except that wouldn't work either; you need to delete the flowtable at startup if the feature is disabled.)

hmm, I need to document and understand this better, if there is a way to cleanup we already have a kube-proxy cleanup command so I wonder if we can reuse it for this

I was talking about downgrading to a release that doesn't know anything about flowtables, but I guess that's only a problem for people using it during Alpha.

Anyway, you just need to make it so that if you start nftables kube-proxy in a version that knows about flowtables, but with flowtables disabled, then it should ensure the flow table doesn't exist when it does its first sync rather than ensuring that it does exist.

@danwinship I see now in the code that we flush chain by chain instead of the whole table when starting ... is there any reason for doing that? flushing the whole table allow to handle downgrade/upgrade of the table schema without further considerations

When I first wrote the code, I thought nft flush table ip kube-proxy would flush the contents of sets too, which would mean that restarting kube-proxy would break any active affinity. But it turns out that it doesn't actually do that anyway; flushing a table just flushes (but does not delete) every chain within the table.

then, should we change kube-proxy to flush the entire table?

flush still keeps the flowtable , so it is not valid

I need to think if is possible to have an integrity check 🤔 and do a full resync

Use the kernel flowtables infrastructure to allow kube-proxy users to accelerate service traffic. Change-Id: Iee638c8e86a4d17ddbdb30901b4fb4fd20e7dbda

keps/sig-network/4963-kube-proxy-services-acceleration/README.md

adrianmoisey · 2024-12-09T17:57:41Z

keps/sig-network/4963-kube-proxy-services-acceleration/README.md

+
+The [elephant flow detection is a complex topic with a considerable number of literature about it](https://scholar.google.pt/scholar?q=elephant+flow+detection&hl=es&as_sdt=0&as_vis=1&oi=scholart). For our use case we proposa a more simplistic approach based on the number of packets, so it can give us good trade off between performance improvement and safety, we want to avoid complex heuristics and have a more predictible and easy to think about behavior based on the lessones learned from [KEP-2433 Topology aware hints](https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2433-topology-aware-hints#proportional-cpu-heuristic).
+
+We chose 20 as the number of packets as the threshold to offload, based on existing thresold used by Cisco systems in [Cisco Application Centric Infrastructure](https://scholar.google.com/scholar_lookup?hl=en&publication_year=2018&author=G.+Tam&title=Cisco+Application+Centric+Infrastructure), Cisco referred to an elephant flow if the flow contains more than 15 packet sizes, i.e., short flow is less than 15 packets, we add 5 packets of buffer to be on the safe side. This means that, using TCP as an example, and assuming an MTU of 1500 bytes and removing the overhead of the TCP headers (that can vary from 20-60 bytes, use 40 for this example), offloading will benefit workloads that transmit more than: TCP Payload * Number of packets = 1460 bytes/packet * 20 = 29200 bytes.


This seems sensible, and I like that it uses the lessons of users to inform how Kubernetes could operate.
What I'm wondering, is there a disadvantage to stream of 21 packets in length?

I assume this implementation will mean that packets 1-20 will go via the normal path, then packet 21 will be offloaded to the fast path.

I'm just wondering what happens, to performance, if the stream is offloaded to the fast path just before it ends.

I'm just wondering what happens, to performance, if the stream is offloaded to the fast path just before it ends.

The great thing of IP is that is independent of the path, same as you go to google.com through a lot of devices and links, from one pod or the other that the packet takes does not matter unless there are operations on the packet ... but that should be extraordinary and that is why we proposal a knob to let users to completely disable this behavior

If you see the netperf test below, they are exercising:

no offload

offload after connection established

offload after packet number 20

and the performance improvement is still considerable, since only a max of 29200 bytes or 20 packets are hitting the slow path ... of course the larger your stream of data the closer to the asymptotic limit

Co-authored-by: Adrian Moisey <[email protected]>

adrianmoisey · 2024-12-17T10:00:34Z

/lgtm

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 15, 2024

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Nov 15, 2024

k8s-ci-robot requested review from danwinship and jeremyrickard November 15, 2024 05:01

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 15, 2024

aojea changed the title ~~KEP-4963: use flowtables to accelerate kube-proxy~~ WIP - KEP-4963: use flowtables to accelerate kube-proxy Nov 15, 2024

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 15, 2024

aojea marked this pull request as draft November 15, 2024 05:01

aojea mentioned this pull request Nov 15, 2024

Kube-proxy Services Acceleration #4963

Open

4 tasks

aojea force-pushed the flowtables branch from 484217b to fe35ab9 Compare November 28, 2024 12:07

aojea changed the title ~~WIP - KEP-4963: use flowtables to accelerate kube-proxy~~ KEP-4963: Kube-proxy Services Acceleration Nov 28, 2024

aojea force-pushed the flowtables branch from fe35ab9 to 8442122 Compare November 28, 2024 12:09

aojea marked this pull request as ready for review November 28, 2024 12:10

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 28, 2024

k8s-ci-robot requested a review from johnbelamaric November 28, 2024 12:10

k8s-ci-robot assigned danwinship and thockin Nov 28, 2024

adrianmoisey reviewed Nov 28, 2024

View reviewed changes

keps/sig-network/4963-kube-proxy-flowtables-fastpath/README.md Outdated Show resolved Hide resolved

adrianmoisey reviewed Nov 28, 2024

View reviewed changes

keps/sig-network/4963-kube-proxy-flowtables-fastpath/README.md Outdated Show resolved Hide resolved

danwinship reviewed Dec 2, 2024

View reviewed changes

aojea force-pushed the flowtables branch from 8442122 to 400011e Compare December 9, 2024 12:19

KEP-4963: Kube-proxy Services Acceleration

84ebf69

Use the kernel flowtables infrastructure to allow kube-proxy users to accelerate service traffic. Change-Id: Iee638c8e86a4d17ddbdb30901b4fb4fd20e7dbda

aojea force-pushed the flowtables branch from 400011e to 84ebf69 Compare December 9, 2024 15:46

adrianmoisey reviewed Dec 9, 2024

View reviewed changes

keps/sig-network/4963-kube-proxy-services-acceleration/README.md Outdated Show resolved Hide resolved

adrianmoisey reviewed Dec 9, 2024

View reviewed changes

Update keps/sig-network/4963-kube-proxy-services-acceleration/README.md

03f7665

Co-authored-by: Adrian Moisey <[email protected]>

aojea mentioned this pull request Dec 13, 2024

kube-proxy: net.InterfaceAddrs may return error due to a race condition, causing the Nodeport Service to be inaccessible kubernetes/kubernetes#129147

Open

k8s-ci-robot assigned adrianmoisey Dec 17, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-4963: Kube-proxy Services Acceleration #4964

KEP-4963: Kube-proxy Services Acceleration #4964

aojea commented Nov 15, 2024 •

edited

Loading

k8s-ci-robot commented Nov 15, 2024

aojea commented Nov 28, 2024

adrianmoisey commented Nov 28, 2024

danwinship left a comment

danwinship Dec 2, 2024

aojea Dec 2, 2024

adrianmoisey Dec 3, 2024

danwinship Dec 3, 2024

aojea Dec 4, 2024

danwinship Dec 2, 2024 •

edited

Loading

aojea Dec 2, 2024

danwinship Dec 3, 2024

aojea Dec 9, 2024

danwinship Dec 9, 2024

aojea Dec 9, 2024 •

edited

Loading

aojea Dec 9, 2024

aojea Dec 9, 2024

adrianmoisey Dec 9, 2024 •

edited

Loading

aojea Dec 9, 2024

adrianmoisey commented Dec 17, 2024


		Users will be able to opt-in to Service traffic acceleration by passing a CEL expression using the flag `--accelerated-interface-expression` or the configuration option `AcceleratedInterfaceExpression` to match the network interfaces in the node that are subjet to Service traffic acceleration. The absence of a CEL expression disables the feature.

		Kube-proxy will create a `flowtable` in the kube-proxy table with the name `kube-proxy-flowtable` and will monitor the network interfaces in the node to populate the `flowtable` with the interfaces that match the configured CEL expression.


		### Upgrade / Downgrade Strategy

		kube-proxy reconcile the nftables rules so the rules will be reconciled during startup and added or removed depending on how kube-proxy is configured.


		The [elephant flow detection is a complex topic with a considerable number of literature about it](https://scholar.google.pt/scholar?q=elephant+flow+detection&hl=es&as_sdt=0&as_vis=1&oi=scholart). For our use case we proposa a more simplistic approach based on the number of packets, so it can give us good trade off between performance improvement and safety, we want to avoid complex heuristics and have a more predictible and easy to think about behavior based on the lessones learned from [KEP-2433 Topology aware hints](https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2433-topology-aware-hints#proportional-cpu-heuristic).

		We chose 20 as the number of packets as the threshold to offload, based on existing thresold used by Cisco systems in [Cisco Application Centric Infrastructure](https://scholar.google.com/scholar_lookup?hl=en&publication_year=2018&author=G.+Tam&title=Cisco+Application+Centric+Infrastructure), Cisco referred to an elephant flow if the flow contains more than 15 packet sizes, i.e., short flow is less than 15 packets, we add 5 packets of buffer to be on the safe side. This means that, using TCP as an example, and assuming an MTU of 1500 bytes and removing the overhead of the TCP headers (that can vary from 20-60 bytes, use 40 for this example), offloading will benefit workloads that transmit more than: TCP Payload * Number of packets = 1460 bytes/packet * 20 = 29200 bytes.

KEP-4963: Kube-proxy Services Acceleration #4964

Are you sure you want to change the base?

KEP-4963: Kube-proxy Services Acceleration #4964

Conversation

aojea commented Nov 15, 2024 • edited Loading

k8s-ci-robot commented Nov 15, 2024

aojea commented Nov 28, 2024

adrianmoisey commented Nov 28, 2024

danwinship left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianmoisey Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianmoisey commented Dec 17, 2024

aojea commented Nov 15, 2024 •

edited

Loading

danwinship Dec 2, 2024 •

edited

Loading

aojea Dec 9, 2024 •

edited

Loading

adrianmoisey Dec 9, 2024 •

edited

Loading