Skip to content

Conversation

@bgaussen
Copy link

What this PR does / why we need it:

This design proposal aims at implementing a new network binding plugin to support vhostuser interfaces. This will allow fast userspace datapath when used with a userspace dataplane like OVS-DPDK or VPP.
This design proposal takes into consideration sockets sharing issues between kubevirt and dataplane pods.

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

  • Design: A design document was considered and is present.
  • PR: The PR description is expressive enough and will help future contributors
  • Community: Announcement to kubevirt-dev was considered

Release note:

NONE

This design proposal implements a new network binding plugin to support vhostuser interfaces.
This will allow fast userspace datapath when used with a userspace dataplane like OVS-DPDK or VPP.
This design proposal takes into consideration sockets sharing issues between kubevirt and dataplane pods.

Signed-off-by: Benoît Gaussen <[email protected]>
@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label May 22, 2024
@kubevirt-bot
Copy link
Contributor

Hi @bgaussen. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

- hides the KubeVirt implementation details. Currently, you need to know where the KubeVirt sockets are located in the virt-launcher filesystem. Potentially, if we change the directory path for the sockets, this would break the CNI plugin
- can isolate the resources dedicated to that particular plugin

The drawback side is that `virt-launcher` pod would need to be run with `privileged: true` in order to do the bind mount.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true. The directory can be exposed by virt-handler. Hence, no need of privileged for virt-launcher. This would also be unfeasible since virt-launcher is untrusted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see very well the interaction between virt-handler, virt-launcher and the exposed directories. Need to check that.

- dataplane: 1
This only resource is resquested by the userspace dataplane, and add a `/var/run/vhost_sockets` mount to the dataplane pod.
- vhostuser sockets: n
This as many resources as we want to handle, is requested by the `virt-launcher` pod using vhostuser plugin. This makes the device plugin create a per pod directory like `/var/run/vhost_sockets/<launcher-id>`, and mount it into the `virt-launcher` pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried this? I still worry that the directory created by the dp will still have the original issue with the mount propagation HostToContainer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only bind mount needed is the one the DP will push through kubelet. As there will be no further bind mount inside the pods, and as it will not be necessary for the CNI to do it neither, there should be no issue with mountPropagation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still I'm not sure how kubelet mounts the directory inside the pod. It might be possible that it is with HostToContainer and the socket won't be visible in the host directory. I would really like to check it with real code.

Copy link
Author

@bgaussen bgaussen May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your doubts ;) So I tested the scenario with generic-device-plugin, and the following resources definition:

        - --domain
        - dataplane.io
        - --device
        - |
          name: dataplane
          groups:
            - count: 1
              paths:
                - path: /var/lib/sockets
                  mountPath: /var/lib/sockets
                  type: Mount
                  permissions: mrw
        - --device
        - |
          name: sockets
          groups:
            - paths:
                - path: /var/lib/sockets/pod*
                  mountPath: /var/lib/socket
                  type: Mount
                  permissions: mrw

When a pod's container requests a datplane.io/socket, it will get assigned one of the existing /var/lib/sockets/podXXX, that gets bind mounted to /var/lib/socket in the container. Any file created there by the container is available in host /var/lib/sockets/podXXX.

If we have a dataplane pod requesting dataplane.io/dataplane, the whole /var/lib/sockets directory is bind mounted to /var/lib/sockets, and the dataplane container can see the files created in /var/lib/sockets/podXXX.
Reverse is also true if dataplane container creates a file in /var/lib/sockets/podXXX it is available to the earlier pod's container.

I tested with only unprivileged pods.

By the way I checked the propagation option of the target mount, it's the default private.

/ # findmnt -o TARGET,PROPAGATION /var/lib/one-socket
TARGET              PROPAGATION
/var/lib/one-socket private

But as far as we don't create new mount in that target, there is no propagation issue.

- dataplane: 1
This only resource is resquested by the userspace dataplane, and add a `/var/run/vhost_sockets` mount to the dataplane pod.
- vhostuser sockets: n
This as many resources as we want to handle, is requested by the `virt-launcher` pod using vhostuser plugin. This makes the device plugin create a per pod directory like `/var/run/vhost_sockets/<launcher-id>`, and mount it into the `virt-launcher` pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the pod ID know by the CNI plugin or outside KubeVirt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP can push annotation to the pod with an id it will generate. So it's not really the virt-launcher pod id, as it does not exist at the time DP is called.
The CNI can read the pod annotation and use it to configure the vhostuser with the right path in the data plane.

We still have to care about directory and sockets permission (and SELinux categories?).

## API Examples
(tangible API examples used for discussion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please. still add here the examples, it helps with the reviews and when this proposal is merged

@EdDev
Copy link
Member

EdDev commented May 22, 2024

/sig-network
/assign

Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited this a bit late, sorry.

There are several points which I do not fully understand yet.
I would be interested in having a meeting to discuss this in detail, hopefully clarifying a few points and allowing this to be pushed forward.

The `vhostuser` secondary interfaces configuration in the dataplane is under the responsibility of Multus and the CNI such as `userspace CNI`.

## Definition of Users
Users of the feature are everyone that deploys a VM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we have these basic users:

  • Cluster Admin
  • VM User
  • Guest User

In this case, I think you also have:

  • Network Binding Plugin Developer

- As a user, I want the `vhostuser` interface to be configured with a specific MAC address.
- As a user, I want to enable multi-queue on the `vhostuser` interface
- As a Network Binding Plugin developper, I want the shared socket path to be accessible to virt-launcher pod
- As a CNI developper, I want to access the shared vhostuser sockets from Multus pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is also under Network Binding Plugin developer.

- As a CNI developper, I want to access the shared vhostuser sockets from Multus pod

## Repos
Kubevirt repo, and most specificaly cmd/sidecars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please provide links for clarity?

1. Creates a new interface with `type='vhostuser'`
2. Set the MAC address if specified in the VMI spec
3. If `networkInterfaceMultiqueue` is set to `true`, add the number of queues calculated after the number of cores of the VMI
4. Add `memAccess='shared'` to all NUMA cells elements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth adding a note that the change needs to be idempotent.

To clarify, the hook is called multiple times and the result needs to be consistent between the first and following changes.

Also, I am unsure what are the side-effects of such a marking. Please share how such a change will influence the VM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ed,

You mean the hook is called one time for each device using the same binding plugin? If so yes we need to add some words about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OnDefineDomain hook can be called multiple times by Kubeivrt regardless of devices count.
Thus the sidecar implementation for OnDefineDomain hook should be idempotent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the hook is called one time for each device using the same binding plugin? If so yes we need to add some words about that.

Just assume that it can be called many times, you should not assume how many times.
Today, it is called twice in a normal flow, for the purpose of calculating some default that libvirt provides.
In that past it was called many more times (but we got it optimized since).

In a flow with failures, reconciliation may cause it to be called several times.

In the future, other hook points in the code flow may trigger it.

</interface>
```

This design leverages the existing `sockets` emptyDir mounted in `/var/run/kubevirt/sockets`. This allows the CNI to bind mount the socket emptyDir (`/var/lib/kubelet/<pod uid>/volumes/kubernetes.io~empty-dir/sockets`) to a host directory available to the dataplane pod through a hostPath mount.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused with the changes this proposals has passed through.
Can you please keep the alternatives and the reasons they have been kept out in an appendix?
E.g. #294 (comment) discussed the DP performing the mounting.

Regarding this intro sentence:

  • How exactly can a CNI perform a mount to the pod/container? At the time the CNI is executed, there is not filesystem namespace defined yet. The CNI is also unlikely to have access to the api-server, it is unsafe and limited to trusted binaries only (e.g. Multus).
  • The diagram describes in step 4 that qemu creates the socket.
    First, it will be nice to see that diagram steps described in details in text as well.
    In general, libvirt handles interfaces in an unmanaged manner, i.e. all it needs is provided to it (e.g. tap device). This is done to allow libvirt to operate in an unprivileged/rootless mode.
    • Is the socket creation possible with the rootless priv?
    • Can the socket be created by something else and then just consumed by libvirt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this intro sentence:

  • How exactly can a CNI perform a mount to the pod/container? At the time the CNI is executed, there is not filesystem namespace defined yet. The CNI is also unlikely to have access to the api-server, it is unsafe and limited to trusted binaries only (e.g. Multus).

Indeed CNI needs to access api-server. Multus 3 allowed that through its own kubeconfig.multus. With Multus 4 thick mode, it's no longer possible, and CNI requesting api-server access needs to handle their own credentials creation.

  • The diagram describes in step 4 that qemu creates the socket.
    First, it will be nice to see that diagram steps described in details in text as well.
    In general, libvirt handles interfaces in an unmanaged manner, i.e. all it needs is provided to it (e.g. tap device). This is done to allow libvirt to operate in an unprivileged/rootless mode.

    • Is the socket creation possible with the rootless priv?

Yes it is, as far the filesystem permissions AND SElinux allows it.

  • Can the socket be created by something else and then just consumed by libvirt?

It can. vhostusers sockets are created in server mode, and consumed in client mode. Usually qemu is in server mode, and the data plane the client. By the way the other way is getting deprecated in ovs-dpdk for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification.
As a generic policy, I would not recommend accessing the api-server from the CNI binary. It is unsafe and can compromise the cluster, especially when a worker node is compromised.
You can add rules to limit the access and mitigate the vulnerability, but it will be much cleaner if it is avoided.

If the mount path is known in advance, I think we can find alternatives.


## Alternative designs

Some alternative designs were discussed in ![kubevirt-dev mailing-list](https://groups.google.com/g/kubevirt-dev/c/3w_WStrJfZw/m/yWSBpDAKAQAJ).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please summarize them in an appendix so we can see the full picture in one place.

Comment on lines 95 to 100
This requires to implement a new network binding plugin mechanism we could expose the content of a `virt-launcher` directory to an external plugin.
The plugin registration in Kubervirt resource would define the target directory on the node where the directory should be exposed.

This diagram explains this mechanism.

![kubevirt-plugin-extension](kubevirt-plugin-extension.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what is suggested here.
In virt-launcher we have a plugin container (the sidecar) and the compute container in which libvirt runs.
We have virt-handler in which a privileged operations may be conducted on the virt-launcher and on the api-server.

  • Do you want a new parameter to be added to the network binding plugin spec to share data of the plugin container with the node, the virt-handler or the compute container in the virt-launcher?
  • I do not understand how this is related to the CNI plugin. I guess this is related to my misunderstanding on how a CNI plugin can mount anything (I asked it in a different thread).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what is suggested here. In virt-launcher we have a plugin container (the sidecar) and the compute container in which libvirt runs. We have virt-handler in which a privileged operations may be conducted on the virt-launcher and on the api-server.

I guess both virt-launcher and virt-handler needs to be clarified when it comes to network binding plugins.

  • Do you want a new parameter to be added to the network binding plugin spec to share data of the plugin container with the node, the virt-handler or the compute container in the virt-launcher?

In that case it'd be a new network binding plugin spec parameter.

  • I do not understand how this is related to the CNI plugin. I guess this is related to my misunderstanding on how a CNI plugin can mount anything (I asked it in a different thread).

Indeed, the CNI needs to know where the socket is located to configure the data plane with this socket. In current userspace CNI implementation, the CNI does a bind mount of the socket into data plane filesystem namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CNI does a bind mount of the socket into data plane filesystem namespace.

This part I do not understand.
AFAIU there is no socket yet, as that is created by libvirt when the domain is created.
Perhaps we can clarify this by listing all the steps that are actually done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant "the CNI does a bind mount of directory where the socket will be created"...
You can refer to the diagram included in the DP.

@EdDev
Copy link
Member

EdDev commented Jun 2, 2024

/sig network

@toelke
Copy link

toelke commented Jun 17, 2024

I only found this discussion now, so I am very late in offering our experience:

We are productively using a self-build vhost-user management:

  1. An admission webhook patches all pods with label kubevirt.io=virt-launcher to contain a hostPath volume (/run/vpp, in our case) and corresponding volumeMounts
  2. A sidecar that patches the xml in a way very similar to this proposal. We add a new PCI subtree for the virtual devices, but I know that we are breaking kubevirt expectations there. Please note that having <reconnect enabled="yes"/> is useful here.

In our case, the driver creating the socket also creates a config.json in the same folder that is read by the sidecar to communicate information to the sidecar.

I will join the meeting on Wednesday.

Copy link
Contributor

@ormergi ormergi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal

Please see my inline comments

</numa>
</cpu>
<interface type='vhostuser'>
<source type='unix' path='/var/run/kubevirt/sockets/poda08a0fcbdea' mode='server'/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, does the socket is created by the CNI inside virt-launcher pod?

If so, I think the socket path can be reflected on Multus network-status annotation under the subject element device-info, see https://github.com/k8snetworkplumbingwg/device-info-spec/blob/main/SPEC.md#315-vhost-user

The vhost-user device-info can be exposed using the network binding-plugin API's downardAPI.
The sidecar could then read the vhost-user socket from the mounted downwardAPI volume and spesify it in the domain xml.

Please see vDPA example for reference how downardAPI can be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the hints. Indeed the new network binding plugin downardAPI will be helpful!

1. Creates a new interface with `type='vhostuser'`
2. Set the MAC address if specified in the VMI spec
3. If `networkInterfaceMultiqueue` is set to `true`, add the number of queues calculated after the number of cores of the VMI
4. Add `memAccess='shared'` to all NUMA cells elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OnDefineDomain hook can be called multiple times by Kubeivrt regardless of devices count.
Thus the sidecar implementation for OnDefineDomain hook should be idempotent.

- As a user, I want to create a VM with one or serveral `vhostuser` interfaces attached to a userspace dataplane.
- As a user, I want the `vhostuser` interface to be configured with a specific MAC address.
- As a user, I want to enable multi-queue on the `vhostuser` interface
- As a Network Binding Plugin developper, I want the shared socket path to be accessible to virt-launcher pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
- As a Network Binding Plugin developper, I want the shared socket path to be accessible to virt-launcher pod
- As a Network Binding Plugin developer, I want the shared socket path to be accessible to virt-launcher pod

@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 26, 2024
@bgaussen
Copy link
Author

Hi all,

I just committed modifications to this design proposal PR to focus on the device plugin based solution to facilitate socket sharing, as discussed during last week meeting.

Regards,

Benoit.

This as many resources as we want to handle and may represent the number of port of the dataplane vSwitch.
It is requested through VM or VMI definition in resources request spec. In turn the `virt-launcher` pod will request the same resources.
This makes the device plugin create a per pod directory like `/var/run/vhost_sockets/<pod-dp-id>`, and mount it into the `virt-launcher` pod as to well known location `/var/run/vhostuser`.
The device plugin has to generate a `pod-dp-id` and push it as an annotation in `virt-launcher` pod. This will be used later by the CNI or any component that configures the vhostuser socket in the dataplane with right path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that CNI plugin that interacts with the cluster API (e.g.: get pod annotation) is not a good practice and better to avoid (for example if the CNI hangs pod creation hangs).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right and we are trying to not rely on API server for the CNI to get annotations.
For example this Mutlus PR could ease the task.

## Repos
Kubevirt repo, and most specificaly [cmd/sidecars](https://github.com/kubevirt/kubevirt/tree/main/cmd/sidecars).

## Design
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which kind of networks the plugin is going to support? (pod network, secondary networks)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin is designed to support secondary networks.

- check the VM is running

# Implementation Phases
1. First implementation of the `network-vhostuser-binding` done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you link to where that is?

Copy link
Author

@bgaussen bgaussen Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuval, sorry it's not yet ready to be shared publicly. We can have a talk if you want to have a look at the current implementation.

@bgaussen
Copy link
Author

Hi all,

A little update: we're currently implementing the Device Plugin for sockets sharing. We're implementing the device-info spec to share device information between Device Plugin and CNI, and we'll rely on DownwardAPI support in Kubevirt 1.3 to get network-status annotation in the network binding plugin (as suggested by @ormergi, thx 😁).

I'll update the Design Proposal ASAP.

@bgaussen
Copy link
Author

bgaussen commented Sep 6, 2024

Hi All,

We made great progress and implemented Device Plugin (DP), CNI and Network Binding Plugin (NBP):

  • DP handles socketDirs where vhostuser sockets will be created, and injects the mount into the compute container requesting the socket device.
  • DP is compliant with device-info-spec , as pointed by @ormergi, and creates device plugin devinfo files
  • CNI implements also device-info-spec. That allows to get allocated device spec, including the socketDir path. It enables the CNI to push the spec, through multus, to virt-launcher pod annotations (that is available at the end in kubevirt.io/network-info annotation)
  • NBP retrieves the socketDir path from the annotation and DownwardAPI and creates the vhostuser interface in DomainXML with the right path to the socket

Example of this interface:

    <interface type='vhostuser'>
      <mac address='02:15:7e:00:00:05'/>
      <source type='unix' path='/var/lib/sockets/socketDir04/pod6c270ef2f25' mode='server'/>
      <target dev='pod6c270ef2f25'/>
      <model type='virtio-non-transitional'/>
      <driver name='vhost' queues='4' rx_queue_size='1024' tx_queue_size='1024'/>
      <alias name='ua-net1'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </interface>

where /var/lib/sockets/socketDir04 is a random directory managed by the DP.

So far so good... but...

When we live migrate the VM, a new destination pod is created, with a new socket request to DP. The socketDir will obviously not be the same as the one on source pod (ex: /var/lib/sockets/socketDir09)

But from what I understand the source domainXML is sent to the destination pod and qemu tries to start with the source socketDir, and fails:

compute {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"ubuntu-vm1","namespace":"tests","pos":"live-migration-source.go:845","reason":"error encountered during Migrate
ToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: process exited while connecting to monitor: 2024-09-06T13:12:18.279191Z qemu-kvm: -chardev socket,id=charua-net1,path=/var/lib/sockets/
socketDir04/pod6c270ef2f25,server=on: Failed to bind socket to /var/lib/sockets/socketDir04/pod6c270ef2f25: No such file or directory')","timestamp":"2024-09-06T13:12:18.334616Z","uid":"33c51f90-1843-48ed-851f-4cb
11d75af41"}

Do you have any idea on how to manage this situation? Is there any plan to implements something through NBP livemigrate methods that could alter the destination domainXML?

As a "temporary" workaround, we intend to use a mutating webhook to inject a shared empty-dir volume in virt-launcher pod spec, that will be shared between NBP and compute containers (mounted in /var/lib/vhost-sockets). NBP will use the info from the kubevirt.io/network-info annotation/downwardAPI to create a link in the shared volume (/var/lib/vhost-sockets/net1 -> /var/lib/sockets/socketDir09).
The vhostuser socket path in domainXML will be /var/lib/vhostsockets/net1/pod6c270ef2f25, insuring that it will be accessible from the compute container and be the same on source and destination pods in live migration case.

Do you think the NBP spec could be amended to add a shared volume between compute and NBD sidecars?

Thanks,

Benoit.

@ormergi
Copy link
Contributor

ormergi commented Sep 8, 2024

Hi @bgaussen , well done!

It looks like Kubevirt calls sidecar hooks also at the migration target here

Could you please ensure the NBP sidecar executes at the target?

@bgaussen
Copy link
Author

bgaussen commented Sep 8, 2024

Hi @bgaussen , well done!

It looks like Kubevirt calls sidecar hooks also at the migration target here

Could you please ensure the NBP sidecar executes at the target?

Yes the sidecar hooks get executed, and from its logs it sent back the modified domainXML with the right socketDir on destination pod.

But from what I understand here, the source domain is used for MigrateToURI3 libvirt call, with little modification in PrepareDomainForMigration.

@bgaussen
Copy link
Author

Hi All,

After some private discussion with @alicefr, it seems that the migration domain takes precedence over the one generated at pod creation. From what I understand, it could be difficult to fix this behavior but I will open an issue with that.

In the meantime, for live migration to work, we need the sockets paths be the same on the source and target pods. Hence the symlinks mechanism seems a good way to achieve that.
But we need the Network Binding Plugin to access the compute container mount namespace, with two options:

  1. A new shared empty-dir volume between compute and NBP sidecar containers.
    It requires to be added to NBP spec. Could be interesting if other NBP requires such a feature...

  2. Enable the shareProcessNamespace pod feature on virt-launcher. The sidecar container can target on of the compute container process, and access its mount namespace through /proc/PID/root, where the NBP can create the symlink.
    There is already this PR we can benefit of: Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing kubevirt#11845

We already implemented and tested 2., it works as expected. The only drawback is that we need to target virtlogd process. For other ones (virt-launcher-monitor, virt-launcher, qemu*), /proc/PID/root is not accessible.

We can discuss this in a community meeting and/or mailing list.

Benoit.

…evice-info-spec, downwardAPI and shareProcessNamespace

Signed-off-by: Benoît Gaussen <[email protected]>
@bgaussen bgaussen force-pushed the vhostuser-network-binding-plugin branch from d74c8f7 to 01dfcc6 Compare October 14, 2024 18:16
@kubevirt-bot kubevirt-bot removed the dco-signoff: no Indicates the PR's author has not DCO signed all their commits. label Oct 14, 2024
Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the later feedback.
I'll make sure to stay available for this and target this proposal for v1.6 timeline (especially if we'll need Kubevirt core changes).

- As a CNI Developer, I want to know whet vhostuser sockets are located

## Repos
Kubevirt repo, and most specificaly [cmd/sidecars](https://github.com/kubevirt/kubevirt/tree/main/cmd/sidecars).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this will be under a separate repo, it will be better.
We used the existing network binding plugins as reference and experimental.
Through a separate repo ownership and coverage can be focused, not effecting the core project.

Said that, I understand that this may be a burden to setup such a thing.
Is it an option to dedicate a repo for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand Kubevirt repo will host only reference and sample plugins. If not possible to add vhostuser one there, we'll look into hosting it in a separate repo.


#### Network Binding Plugin and Kubevirt requirements

Network Binding Plugin then can leverage `downwardAPI` feature available from Kubevirt v1.3.0, in order to retrieve the `kubevirt.io/network-info` annotation values, and extract the socket path to configure the interface in the domain XML.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which Net Binding Plugin component can leverage this and how? We have a CNI and sidecar, is there something else considered here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sidecar container is getting kubevirt.io/network-info thanks to downwardAPI.

The kubevirt.io/network-info annotation is created after k8s.v1.cni.cncf.io/network-status, that is added by Multus and the CNI.


Hence, Network Binding Plugin needs to use immutable paths to sockets. This can be achieved using the interface name (or its hash version) in symbolic links to the real socket path: `/var/run/kubevirt/vhostuser/net1` -> `/var/run/vhostuser/<socketXX>`.

This requires an enhancement in KubeVirt, and Network Binding Plugin KubeVirt CRD spec, in order for `virt-launcher` pod to have a shared `emptyDir` volume, mounted in both `compute` and `vhostuser-network-binding-plugin` containers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a working alternative is to deploy a mutating webhook for this.
I am in favor of this sharedDir, in its customized from (choosing the path, as presented here) or stricter one which the path is hard-coded. The later seems safer and consistent, but I do not have a strong opinion here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigation: There is already a shared path for the grpc socket between the sidecar and compute container, an idea was raised to check if it can be re-used for this purpose as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a working alternative is to deploy a mutating webhook for this. I am in favor of this sharedDir, in its customized from (choosing the path, as presented here) or stricter one which the path is hard-coded. The later seems safer and consistent, but I do not have a strong opinion here.

Indeed, we tested and run this thanks to a mutating webhook. Regarding the customization of the path, you're right it could be a hardcoded path per plugin. But I think we need to be sure that 2 plugins do not use a same path, otherwise overwrite could happen.

Investigation: There is already a shared path for the grpc socket between the sidecar and compute container, an idea was raised to check if it can be re-used for this purpose as well.

We initially used this one, but same remark as above.

@sbrivio-rh
Copy link

By the way,

This design proposal aims at implementing a new network binding plugin to support vhostuser interfaces. This will allow fast userspace datapath when used with a userspace dataplane like OVS-DPDK or VPP.

Now that passt supports vhost-user operation, the integration in KubeVirt is in progress (one step here), and that will be another form of fast user-mode data path.

Perhaps this pull request (I didn't thoroughly go through all the commits, though) should mention this peculiarity ("hardware-attached"? "hardware direct"?) to avoid confusion.

How long is drop using Live Migration?

Also by the way, what we do in passt now is to save, migrate, and restore TCP queues using the TCP_REPAIR socket option. I guess one could do something similar with DPDK, eventually (migrating pending queue slots). I'm mentioning this because, as a step to enable passt migration, vhost-user-based migration of virtio-net back-ends was implemented in QEMU (https://patchew.org/QEMU/[email protected]/). A direct back-end could probably do something similar.

binding:
vhostuser:
sidecarImage: network-vhostuser-binding:main
sharedDir: /var/run/kubevirt/vhostuser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting back to this.
We want to target this for v1.6 timeline.

Can you please remove this and give a work or two that the intent is to set it as a well-known fixed name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EdDev for the roadmap update.

Do you already have an idea of this well-known fixed name? I assume it will be generic to accommodate all binding plugin?

Copy link
Member

@EdDev EdDev Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about /var/run/kubevirt/sidecar/<net-plugin-name> ?
I'm not sure what the other maintainers will say about this, I am still waiting to see their feedback on the VEP tracker.

FYI @vladikr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be appropriate! That means one emptyDir per plugin or sidecar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugin that requires sidecar == sidecar
Multi ref to a single plugin is still one sidecar.

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vladikr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it looks great!

From the kubevirt core project (kubevirt/kubevirt) we would like to add the shared volume to enable communication between the compute and the sidecar. All other parts are to be implemented in other repos.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2025
@kubevirt-bot
Copy link
Contributor

Pull requests that are marked with lgtm should receive a review
from an approver within 1 week.

After that period the bot marks them with the label needs-approver-review.

/label needs-approver-review

@kubevirt-bot kubevirt-bot added the needs-approver-review Indicates that a PR requires a review from an approver. label Apr 24, 2025
@oshoval
Copy link

oshoval commented May 7, 2025

Hi @bgaussen
Thanks for the feature and design.

Can you please check the approach we did about the requested shared volume ?
kubevirt/kubevirt#14623

The PR refactors the current shared mount, in a way that each container will have a dedicated SubPath,
and then in the new sidecar volumeMount you will be able to create a folder and your socket in it.
Thanks

@bgaussen
Copy link
Author

bgaussen commented May 7, 2025

Hi @oshoval ,

Thank you for the VEP and its implementation!

Can you please check the approach we did about the requested shared volume ?kubevirt/kubevirt#14623

The PR refactors the current shared mount, in a way that each container will have a dedicated SubPath, and then in the new sidecar volumeMount you will be able to create a folder and your socket in it. Thanks

This looks good to me. Thank you for having made it generic for any binding plugin.
Adapting our vhostuser binding implementation, getting the CONTAINER_NAME env and creating the sub directory in the hooks socket shared dir should be quite straightforward.

oshoval added a commit to oshoval/kubevirt that referenced this pull request May 20, 2025
Expose on the sidecar a new env var CONTAINER_NAME,
this way the container will able to know its full path volume mount
on the compute container.
More details can be found on kubevirt/community#294
section "Network Binding Plugin and Kubevirt requirements".

Signed-off-by: Or Shoval <[email protected]>
oshoval added a commit to oshoval/kubevirt that referenced this pull request May 20, 2025
Expose on the sidecar a new env var CONTAINER_NAME,
this way the container will able to know its full path volume mount
on the compute container.
More details can be found on kubevirt/community#294
section "Network Binding Plugin and Kubevirt requirements".

Signed-off-by: Or Shoval <[email protected]>
oshoval added a commit to oshoval/kubevirt that referenced this pull request May 27, 2025
Expose on the sidecar a new env var CONTAINER_NAME,
this way the container will able to know its full path volume mount
on the compute container.
More details can be found on kubevirt/community#294
section "Network Binding Plugin and Kubevirt requirements".

Signed-off-by: Or Shoval <[email protected]>
oshoval added a commit to oshoval/kubevirt that referenced this pull request May 27, 2025
Expose on the sidecar a new env var CONTAINER_NAME,
this way the container will able to know its full path volume mount
on the compute container.
More details can be found on kubevirt/community#294
section "Network Binding Plugin and Kubevirt requirements".

Signed-off-by: Or Shoval <[email protected]>
oshoval added a commit to oshoval/kubevirt that referenced this pull request May 27, 2025
Expose on the sidecar a new env var CONTAINER_NAME,
this way the container will able to know its full path volume mount
on the compute container.
More details can be found on kubevirt/community#294
section "Network Binding Plugin and Kubevirt requirements".

Signed-off-by: Or Shoval <[email protected]>
oshoval added a commit to oshoval/kubevirt that referenced this pull request May 27, 2025
Expose on the sidecar a new env var CONTAINER_NAME,
this way the container will able to know its full path volume mount
on the compute container.
More details can be found on kubevirt/community#294
section "Network Binding Plugin and Kubevirt requirements".

Signed-off-by: Or Shoval <[email protected]>
@oshoval
Copy link

oshoval commented May 29, 2025

Hi Benoit

kubevirt/kubevirt#14768 was merged
just need to get the sidecar's mount path + new env var CONTAINER_NAME as you said to determine the path on compute container,
and to create a new folder on the sidecar that will host the new data plane socket,
all info should be in the PR, please let us know if anything is missing / required.

Please note that path of Unix socket is limited to 108 bytes https://linux.die.net/man/7/unix
we have enough extra bytes, and i believe it is known, just in case

Thank you

@bgaussen
Copy link
Author

bgaussen commented Jun 2, 2025

Thank you @oshoval for your great work.

We'll implement the new env CONTAINER_NAME + dir ASAP in our binding plugin and keep you updated.

@maxpain
Copy link

maxpain commented Jun 13, 2025

Is it possible to use vhost-user to connect a VM to VPP at the current time? Or is it still WIP?

fra2404 pushed a commit to fra2404/kubevirt that referenced this pull request Jun 21, 2025
Expose on the sidecar a new env var CONTAINER_NAME,
this way the container will able to know its full path volume mount
on the compute container.
More details can be found on kubevirt/community#294
section "Network Binding Plugin and Kubevirt requirements".

Signed-off-by: Or Shoval <[email protected]>
@mw-0
Copy link

mw-0 commented Jun 25, 2025

@bgaussen is this still something your are working on? Just wondering as if not i will plan to

@bgaussen
Copy link
Author

@bgaussen is this still something your are working on? Just wondering as if not i will plan to

We have a little delay in implementing the related commit changes in our plugin.
But sure it's something we're committed to deliver.

@maxpain, yes we are connecting VMs to a VPP dataplane with this plugin.

@bgaussen
Copy link
Author

bgaussen commented Sep 8, 2025

Hi all, @oshoval,

We implemented the CONTAINER_NAME env var logic in the vhostuser binding plugin code.

It's working fine for newly created VM/pod : we are using the downwardAPI feature to retrieve the socket path in the kubevirt.io/network-info annotation, we create the symlink to the real socket path (ex: /var/run/kubevirt-hooks/hook-sidecar-0/vhostuser/net1/pod6c270ef2f25) and the VM is starting fine.

However when we live migrate the VM, the kubevirt.io/network-info annotation is not added to the migration target pod, so the binding plugin can't retrieve the real socket path, and can't create the symlink...

Don't know why the downardAPI/annotation logic doesn't apply to migration target pods (the downardAPI volume and mount is well configured in it).

I will create an issue for that.

@oshoval
Copy link

oshoval commented Sep 9, 2025

Hi @bgaussen
Thanks
Please tag me on the issue you open,
If possible please enclose there the PR you work on and manifests / commands (such as kubectl patching of kubevirt CR)
so i will try it locally

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. needs-approver-review Indicates that a PR requires a review from an approver. sig/network size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.