-
Notifications
You must be signed in to change notification settings - Fork 110
Network binding plugin: Support compute container resource overhead #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/uncc @aburdenthehand @jobbler |
ormergi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, overall looks good, see my inline comments.
Regarding the the PR description and second commit message, I think it should also mention that memory overhead is necessary to avoid pod eviction due to passt VMs consume more memory then expected, and improve passt VMs scheduling results; passt VMs wont be scheduled on nodes that doesnt have enough memory.
| containers: | ||
| - name: compute | ||
| resources: | ||
| requests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think resource limits should be as well, in case the pod satisfy QoS class Guaranteed the plugin overhead will not become the reason for violating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, there is logic [1] to automatically add memory limits when needed (there is also an equivalent for CPU).
[1] https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-controller/services/renderresources.go#L189
| Both of the options are not perfect, but the `virt-handler.SyncVMI` | ||
| has fewer cons. Therefore, it was chosen. | ||
|
|
||
| #### Additional resource requests for the virt-launcher compute container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rephrase this section to be vogue about the concrete plugin or dependency inside virt-launcher that requires it, saying in case a plugin requires memory overhead it should be specified in the CR, and refer to passt example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted the wording, please tell me what you think.
design-proposals/network-binding-plugin/network-binding-plugin.md
Outdated
Show resolved
Hide resolved
design-proposals/network-binding-plugin/network-binding-plugin.md
Outdated
Show resolved
Hide resolved
design-proposals/network-binding-plugin/network-binding-plugin.md
Outdated
Show resolved
Hide resolved
|
|
||
| The passt binary is shipped and executed inside the virt-launcher pod. | ||
|
|
||
| ### Domain definition & creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simplify this section saying the sidecar adds passt interface to the domain, similar to vDPA example, maybe mention it use libvirt user-space networking settings https://libvirt.org/formatdomain.html#userspace-slirp-or-passt-connection.
You can also refer to slirp example section saying passt sidecar works in similar way.
design-proposals/network-binding-plugin/network-binding-plugin.md
Outdated
Show resolved
Hide resolved
ed97ca9 to
8432599
Compare
EdDev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First commit review.
Although I placed comments inline on that commit, I think it is not a must example addition.
At the design level, we are not really interested in a specific binding, but on the general concept.
The general concept can be applied on other bindings, just to provide an example on how to use it.
design-proposals/network-binding-plugin/network-binding-plugin.md
Outdated
Show resolved
Hide resolved
| namespace: default | ||
| spec: | ||
| config: '{ | ||
| "cniVersion": "0.3.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use 1.0.X in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed.
Please ack so we can resolve the thread.
| metadata: | ||
| name: virt-launcher-123 | ||
| annotations: | ||
| k8s.v1.cni.cncf.io/networks: '[{"name":"netbindingpasst","namespace":"mynamespace","cni-args":{"logicNetworkName":"default"}}]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logicNetworkName is supposed to be passtnet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed.
Please ack so we can resolve the thread.
design-proposals/network-binding-plugin/network-binding-plugin.md
Outdated
Show resolved
Hide resolved
|
|
||
| ### Configure Pod network namespace | ||
|
|
||
| Not required for passt binding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do configure networking for passt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed.
Please ack so we can resolve the thread.
|
|
||
| ### Run services in the virt-launcher pod | ||
|
|
||
| The passt binary is shipped and executed inside the virt-launcher pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say: Not required for passt binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed.
Please ack so we can resolve the thread.
| <portForward proto='udp'/> | ||
| <model type='virtio-non-transitional'/> | ||
| <backend type='passt' logFile='/var/run/kubevirt/passt.log'/> | ||
| <alias name='ua-default'/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is not in-sync with the network name used in the VM spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed.
Please ack so we can resolve the thread.
EdDev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second commit:
The commit message hints and explains the passt binding and everything is driven from it. While this is correct in terms of why we do all this, I do not think we should start the story there.
The story should start from the need, mentioning passt as an example.
The fact that there is a binary alongside libvirt is not that important. If it was part of libvirt we would have needed to take that into account as well.
Also, it would be better to leave the details to the design and in the commit just provide the topic/subject. That way we can easily review it and adjusts it per that review.
| - domainAttachment (a standard domain definition that exists in | ||
| the core . E.g `tap`, `sriov`). | ||
| - downwardAPI (E.g `device-info`) | ||
| - resourceOverhead (currently only additional memory overhead request for the compute container in the virt-launcher pod is supported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be specific here and mention the compute-container in the name.
E.g. computeResources.
This also opens up the different questions and options which should be discussed in a proposal:
- Should we consider only the compute container? What about the sidecar container?
- Ref the compute container, we can use a specific name for the field (e.g.
computeResources) or we can make it configurable using a flag under a generalresources. - What is the reason for not just doing a
computeMemoryRequestneeds to convince the reviewers.
Now, we can make it even more general by moving away from the specific network usage of this and look at a general sidecar hook that KubeVirt supports.
E.g.: The Kubvirt CR will have a policy for allocating resources to sidecar containers and possible the compute container. Referencing this policy from the network binding definition or from a sidecar hook. This may be an overkill, but thinking about it and examining the pors/cons of it can be useful.
| Some binding plugins may require an additional binary to be shipped inside the virt-launcher compute image. | ||
| This binary requires additional resources that have to be explicitly stated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need to limit it to an additional binary. It could be just more resources from the cgroupv2 of the compute container, consumed by an existing binary (e.g. libvirt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| networkAttachmentDefinition: default/netbindingpasst | ||
| sidecarImage: quay.io/kubevirt/network-passt-binding | ||
| resourceOverhead: | ||
| memory: 800Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you will have to explain why not:
- Directly
memoryOverhead: 500Mi(BTW, it should be 500 and not 800). - With explicit
requestand being open for addinglimit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtimeClasses have a pretty similar concept, maybe worth aligning the API idea
https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/
ie.
overhead:
podFixed:
memory: "120Mi"
cpu: "250m"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtimeClasses have a pretty similar concept, maybe worth aligning the API idea
Interesting, thanks for sharing this.
I think the API could be in sync with the general resources concept.
This allows to control the resource additions to the container, with the ability to extend it in the future to any type of resource and define both request and limits.
The oddity here, as I see it, is how we express the resources of a different container (i.e. compute) and still not lock the ability to do the same for the sidecar itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev I'm late to review this, but I don't understand why this API needs to add anything to the compute container and not simply advertise the resources needed to run this binding container?!
Same way as it's done for hotplug / other containers?
https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-controller/services/renderresources.go#L611
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtimeClasses have a pretty similar concept, maybe worth aligning the API idea https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/
ie.
overhead: podFixed: memory: "120Mi" cpu: "250m"
@fabiand I doubt it aligns well. This overhead API was invented for kata and the overhead is being added to any pod created with this runtimeclass, I don't think it aligns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev I'm late to review this, but I don't understand why this API needs to add anything to the compute container and not simply advertise the resources needed to run this binding container?! Same way as it's done for hotplug / other containers? https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-controller/services/renderresources.go#L611
@EdDev do we expect all network binding to affect the compute container or is it just passt?
passt specifically uses a feature in qemu there for the GetMemoryOverhead functional can identify that passt will be used and add a passt specific overhead - this is instead of a kubevirt CR API.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr It is unknown at this stage how future network binding plugins will affect the compute container's resource consumption.
Network binding plugins have the ability to configure the domain, thus the virt stack might consume additional compute resources which KubeVirt cannot account for.
In the past year, the passt binding was converted from being a core feature to a plugin, so KubeVirt will not know in advance that is is used.
Also, we should strive for treating all plugins as equal IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr my comment was only about the API design, not about using the POD api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr Hi
I see that according to the network binding plugin design document, one of the integration points that seems legit is the pod definition phase
There were no exceptions regarding resource rendering, and Kubevirt CR was advised as a potential API to extend in order to integrate into that point.
Given that the network binding design was accepted, and also the implementation of this design also got into the codebase, I am not sure how to proceed, I'm in favor of accepting this design.
Can you please advice @vladikr ?
8432599 to
416926c
Compare
|
Change: Removed the passt example, as it is not essential for to the proposal. Addressed sidecar and compute container resource specification. |
EdDev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First commit review.
| The sidecar container can have a memory leak and may cause node's destabilization. | ||
|
|
||
|
|
||
| Alternatives: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep here the decided solution and place the alternatives (with a ref from here) to a dedicated appendix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| - Coarse level of control. | ||
| - Only supports CPU and memory requests and limits. | ||
|
|
||
| 2. Additional API for sidecar resource configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specific for the binding plugin, not the sidecar. The sentence is not clear about this.
The definition is per the network binding plugin, and applied if the plugin uses a sidecar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| Cons: | ||
| - Require an API change. | ||
| - The API will probably evolve as additional plugins will be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this point offline.
It was removed.
| Cons: | ||
| - Require an API change. | ||
| - The API will probably evolve as additional plugins will be created. | ||
| - May require cluster admins to adjust plugin resources during the plugin's lifecycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster admin is responsible to register the binding plugin in the first place, so I am unclear what this point means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this point offline, improved the wording.
| - Only supports CPU and memory requests and limits. | ||
|
|
||
| 2. Additional API for sidecar resource configuration: | ||
| The network binding plugin API in the KubeVirt CR could receive an additional input field to specify the sidecar resource requirements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specified resource is define for each instance of usage or per the sidecar, irrelevant of the usage?
E.g. there may be 1 interface using the plugin or there may be 3 interfaces using the plugins in the same VM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed it in the text.
| For each network binding plugin used, the VMI controller will add a label on the virt-launcher pod with the following format: | ||
| `kubevirt.io/network-binding-plugin:<plugin-name>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the admission webhook be able to identify the relevant container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed it in the text.
EdDev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the proposal.
The following general points should be considered and added:
- Effort cost estimation for each option.
| For some plugins, such as passt, there is a need to execute an additional binary in the compute container. | ||
| Since this binary has its own CPU and memory limits, they should be somehow accounted for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not limit it to running an additional binary, the addition in memory or other resources may come from different reasons (e.g. libvirt itself requiring more memory due to the expected configuration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| Alternatives: | ||
| 1. Manually setting the VM's resources: | ||
| The user can override KubeVirt's algorithms and set resource requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic that adds overhead per internal core logic is colliding with adding manually an overhead? Specifically, if one specifies explicitly resource requests, are these being increased by Kubevirt logic or something else is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it works is as follows:
A VM could be defined with both guest memory and memory resource specification:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vm-cirros
spec:
template:
spec:
domain:
memory:
guest: 128Mi
resources:
requests:
memory: 640Mi # 128Mi for the guest + 512Mi for the network binding pluginThe virt-launcher pod's compute container will have a memory request which contains the sum of:
- Guest VM's memory (128Mi in this example).
- Memory overhead for KubeVirt's components (its size is dependent on the VMI's spec)
- Arbitrary memory overhead (512Mi in this example)
The domain XML will contains the the guest's memory specification (128Mi in this example).
As a side note, it is also possible to specify a memory request with less memory than the guest requires.
|
|
||
| Cons: | ||
| - Error prune | ||
| - The user does not take into account the overhead considerations KubeVirt takes when templating a virt-launcher pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is related to my prev question. Can you please confirm this is indeed the case?
I would be surprise if this is how it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true.
I removed this line.
| - The API will probably evolve as additional plugins will be created. | ||
| - May require cluster admins to adjust plugin resources during the plugin's lifecycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like with the sidecar previously, these points are not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this point offline, adjusted the text.
|
|
||
| 2. Additional API for compute container resource overhead: | ||
|
|
||
| The network binding plugin API in the KubeVirt CR could receive an additional input field to specify the resource requirements overhead for the compute container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a need to explicitly specify if this overhead is added per plugin type or per plugin usage count (i.e. per the number of interfaces referencing it from the VM).
It is also important to specify if it is dependent on any other field/setup.
The previous sidecar resource are dependent on the existence of a sidecar, but this one may not have such a dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
416926c to
e0b43e5
Compare
d4eb851 to
c33498b
Compare
Signed-off-by: Orel Misan <[email protected]>
c33498b to
591c3a4
Compare
EdDev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
| ##### Compute Container Resource Overhead | ||
|
|
||
| For some plugins, an additional resource consumption can be expected from the virt-launcher pod compute container. | ||
| For example, there could be need to execute an additional binary in the compute container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what binary can be executed and how it gets there?
How can an external plugin rely on the fact that a specific binary is part of the virt-launcher image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example was derived from the passt use case, on which the passt binary is shipped as part of the virt-launcher compute image.
Another example could be that the plugin configures the domain in a way that causes the virt-stack to consume additional compute resources that KubeVirt cannot account for.
|
|
||
| For some plugins, an additional resource consumption can be expected from the virt-launcher pod compute container. | ||
| For example, there could be need to execute an additional binary in the compute container. | ||
| Since this binary has its own CPU and memory limits, they should be somehow accounted for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should speak of a specific functionality and not about arbitrary binaries.
virt-launcher should be well defined.
If we know what functionality we're enabling that requires additional overhead, kubevirt needs to know the resources it requires. Similar to other aspects accounted for in GetMemoryOverhead.
I would say that if we want an API for this it should be based on a functionality, not a binding.
Each resource consumption of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plugin can perform actions that affect KubeVirt in ways it cannot account for.
Thus the need to externally add resources to the compute container.
|
Thank you for reviewing this proposal @vladikr, sorry it took some time to respond. |
|
On behalf of SIG-compute, can one of @jean-edouard @stu-gott @enp0s3 @vladikr please re-review to move this forward. I can merge if I receive an approve or second lgtm from one of you. Thank you. |
|
Pull requests that are marked with After that period the bot marks them with the label /label needs-approver-review |
|
/cc |
|
Deferring the discussion for after the end year holidays |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
What this PR does / why we need it:
Some network binding plugins require compute resource overhead, for example the passt plugin requires additional memory overhead in the virt-launcher pod's compute container.
Suggest several alternatives to address this issue.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.
Release note: