[linux-nvidia-6.11][Backport] GPU passthrough cuda support #37

KobaKoNvidia · 2025-01-06T07:45:53Z

[Description]
Backport patches from [0] [1] [2] to enable GPU passthrough for CUDA.

[0] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=55:57
[1] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=60:61
[2] https://git-master.nvidia.com/r/plugins/gitiles/linux-stable/+log/refs/heads/dev/nic/iommufd_vsmmu-12122024

[Test plan]

boot up host
boot up VMs in host with 1 GPU, 2 GPUs, 3 GPUs and 4 GPUs.
Run the following to do the basic checks [3]

//
// Get CPU device
//
$ lspci | grep 3D
//
// Show GPU info
//
$ nvidia-smi
//
// Result must be passed
//
$ /root/r570/tests/runtime/gflops/gflops
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t texture_simple
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host

Check the host's dmesg,

[Misc]

Passed arm64&amd64 buildin in Noble.[5]
there're errors in host's dmesg, [4]
there're errors in vm[6]

[3], [5], logs for VMs, host's dmesg and buit logs,
https://drive.google.com/drive/folders/1bJYyfSoIR_BmtW20BWp178WXX8tOXhHo?usp=sharing'
[4], These are also observed in 6.11.0-1002-nvidia-64k and 6.8.0-1005-nvidia-adv-64k

fixed by CMA=1G added in kernel command line.

Jan 06 07:15:54 localhost kernel: arm-smmu-v3 arm-smmu-v3.36.auto: allocated 524288 entries for cmdq
Jan 06 07:15:54 localhost kernel: cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
Jan 06 07:15:54 localhost kernel: cma: number of available pages: => 0 free of 8192 total pages

As per Matt, this's a known issue.

Jan 06 07:15:54 localhost kernel: pci 0009:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
Jan 06 07:15:54 localhost kernel: pci 0009:01:00.0: DOE: [2c8] failed to create mailbox: -5

[6] As per Matt, these are not observed in two sockets platforms.

 845 [    0.685329] arm-smmu-v3 arm-smmu-v3.18.auto: allocated 524288 entries for e     vtq^M
 846 [    0.686323] genirq: Setting trigger mode 1 for irq 84 failed (gic_set_type+     0x0/0x200)^M
 847 [    0.686905] arm-smmu-v3 arm-smmu-v3.18.auto: failed to enable evtq irq^M
 848 [    0.687289] genirq: Setting trigger mode 1 for irq 86 failed (gic_set_type+     0x0/0x200)^M
 849 [    0.687770] arm-smmu-v3 arm-smmu-v3.18.auto: failed to enable gerror irq^M
 850 [    0.688240] arm-smmu-v3 arm-smmu-v3.19.auto: option mask 0x0^M

Export a function that adds pins to an already-pinned huge-page folio. This allows any range of small pages within the folio to be unpinned later. For example, pages pinned via memfd_pin_folios and modified by folio_add_pins could be unpinned via unpin_user_page(s). Link: https://patch.msgid.link/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Acked-by: David Hildenbrand <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit a2ad1b8 linux-next) Signed-off-by: Koba Ko <[email protected]>

iopt_alloc_iova() takes a uptr argument but only checks for its alignment. Generalize this to an unsigned address, which can be the offset from the start of a file in a subsequent patch. No functional change. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 32383c0 linux-next) Signed-off-by: Koba Ko <[email protected]>

The starting address in iopt_pages is currently a __user *uptr. Generalize to allow other types of addresses. Refactor iopt_alloc_pages() and iopt_map_user_pages() into address-type specific and common functions. Link: https://patch.msgid.link/r/[email protected] Suggested-by: Nicolin Chen <[email protected]> Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 99ff06d linux-next) Signed-off-by: Koba Ko <[email protected]>

Add local variables for common sub-expressions needed by a subsequent patch. No functional change. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit c27f0a6 linux-next) Signed-off-by: Koba Ko <[email protected]>

Add subroutines for copying folios to a batch. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit ed9178f linux-next) Signed-off-by: Koba Ko <[email protected]>

Extend pfn_reader_user() to pin file mappings, by calling memfd_pin_folios(). Repin at small page granularity, and fill the batch from folios. Expand folios to upages for the iopt_pages_fill() path. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 92687c7 linux-next) Signed-off-by: Koba Ko <[email protected]>

Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to register memory by passing a memfd plus offset and length. Implement it using the memfd_pin_folios() kAPI. Link: https://patch.msgid.link/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit f4986a7 linux-next) Signed-off-by: Koba Ko <[email protected]>

Support file mappings for mediated devices, aka mdevs. Access is initiated by the vfio_pin_pages() and vfio_dma_rw() kernel interfaces. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 976a40c linux-next) Signed-off-by: Koba Ko <[email protected]>

Add test cases to exercise IOMMU_IOAS_MAP_FILE. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Steve Sistare <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 0bcceb1 linux-next) Signed-off-by: Koba Ko <[email protected]>

nvmochs

Going to need Blackwell support and might as well also pick up a GH devid that Ankit posted a month back so that all GH SKUs have support.
- These apply cleanly to your tree and I tried them out on a C2G4 with a single GPU passed through:
  - New GH devid
    - upstream - 12cd88a vfio/nvgrace-gpu: Add a new GH200 SKU to the devid table
    - https://lore.kernel.org/all/[email protected]/
  - Blackwell series
    - Note: maintainer just replied today with some minor nit feedbacks and Ankit replied- go ahead and pick what is there now, if Ankit gets a v3 out before we commit to the PR we can pick that up instead:
    - https://lore.kernel.org/all/[email protected]/
For your SOB, wanted to point out that you’re using “koba” and “Kobak” instead of “Koba Ko”. Not sure if Canonical cares about consistency or using your full name like upstream does (I have no opinion on it).
Please remove CONFIG_IOMMUFD_TEST / CONFIG_FAULT_INJECTION (and any derivatives) as I don’ t think those are needed/desired for a production kernel.
Please try with IOMMUFD=m instead of =y.
All of the OOT patches should have “NVIDIA: SAUCE” in the title, e.g. NVIDIA: SAUCE: <existing title>
For these OOT patches, instead of exposing our internal URL in the cherry pick linage, can you instead do something like?

(cherry picked from commit aa5b07b29d395195d83d39de0b73e347fbb595c7 nvidia/kstable/dev/nic/wip/smmuv3_nesting-v4-1105202)
Or maybe even better, pick them from public tech preview?
- https://github.com/NVIDIA/NV-Kernels/commits/24.04\_linux-nvidia-adv-6.8
- Note: the "Bypass PFNMAP" patch will still need from Nic’s tree b/c of the pfn API changes that weren’t present in 6.8 tech preview

58a6044 WAR: iommufd/pages: Bypass PFNMAP
5423c6f WAR: Expose PCI PASID capability to userspace
973c582 KVM: arm64: determine memory type from VMA
0369ccf iommu/dma: Support MSIs through nested domains
a5ba867 iommu/arm-smmu-v3: Implement arm_smmu_get_msi_mapping_domain

Can these be picked from upstream instead of Nic’s branch / linux-next?

243c870 iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
db0577c iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
0b8ee2b iommu/arm-smmu-v3: Use S2FWB for NESTED domains
1b61efd iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
bfeae85 iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
23cb9b8 iommu/arm-smmu-v3: Expose the arm_smmu_attach interface
5350a7a iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
69924f3 iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
381dc3d iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
d162502 ACPI/IORT: Support CANWBS memory access flag
6524f01 ACPICA: IORT: Update for revision E.f
58f8da6 vfio: Remove VFIO_TYPE1_NESTING_IOMMU

Were these needed for arm-smmu-v3 dependencies?

d49d328 iommu/tegra241-cmdqv: Limit CMDs for VCMDQs of a guest owned VINTF
e4ea0a7 iommu/arm-smmu-v3: Start a new batch if new command is not supported
0fc3b19 iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV
ab57303 iommu/arm-smmu-v3: Add struct arm_smmu_impl_ops
0180635 iommu/arm-smmu-v3: Add acpi_smmu_iort_probe_model for impl
c069e9d iommu/arm-smmu-v3: Add ARM_SMMU_OPT_TEGRA241_CMDQV
b92579d iommu/arm-smmu-v3: Make symbols public for CONFIG_TEGRA241_CMDQV
14e0041 iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_init
c5b95a4 iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_build_sync_cmd
b736241 iommu/arm-smmu-v3: Issue a batch of commands to the same cmdq

Can these be picked from upstream instead of Nic’s branch?

32e298a Documentation: userspace-api: iommufd: Update vDEVICE
83f599a iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
c57efbd iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
b318b38 iommufd/selftest: Add mock_viommu_cache_invalidate
65185cd iommufd/viommu: Add iommufd_viommu_find_dev helper
e9dd79a iommu: Add iommu_copy_struct_from_full_user_array helper
b0a41e5 iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
ef73641 iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
51b5ab9 iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
e0fc645 iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
09be9cf Documentation: userspace-api: iommufd: Update vIOMMU
eef0c28 iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
6c29886 iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
e7049a8 iommufd/selftest: Add refcount to mock_iommu_device
c55b772 iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
501a752 iommufd/selftest: Add container_of helpers
ecf6a54 iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
8fb1cd3 iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
4827884 iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
e928915 iommufd: Verify object in iommufd_object_finalize/abort()
32ec828 iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
f5d5212 iommufd: Move _iommufd_object_alloc helper to a sharable file
5485214 iommufd: Move struct iommufd_object to public iommufd header

Were these needed for [IOMMU] dependencies?

ceae8db iommufd: Selftest coverage for IOMMU_IOAS_MAP_FILE
b701eae iommufd: File mappings for mdev
f3772c3 iommufd: Add IOMMU_IOAS_MAP_FILE
c931828 iommufd: pfn_reader for file mappings
72a849e iommufd: Folio subroutines
8a3920c iommufd: pfn_reader local variables
8dc5690 iommufd: Generalize iopt_pages address
a3a0953 iommufd: Rename uptr in iopt_alloc_iova()
1e79584 mm/gup: Add folio_add_pins()

Prepare for an embedded structure design for driver-level iommufd_viommu objects: // include/linux/iommufd.h struct iommufd_viommu { struct iommufd_object obj; .... }; // Some IOMMU driver struct iommu_driver_viommu { struct iommufd_viommu core; .... }; It has to expose struct iommufd_object and enum iommufd_object_type from the core-level private header to the public iommufd header. Link: https://patch.msgid.link/r/54a43b0768089d690104530754f499ca05ce0074.1730836219.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit d1b3dad linux) Signed-off-by: Koba Ko <[email protected]>

The following patch will add a new vIOMMU allocator that will require this _iommufd_object_alloc to be sharable with IOMMU drivers (and iommufd too). Add a new driver.c file that will be built with CONFIG_IOMMUFD_DRIVER_CORE selected by CONFIG_IOMMUFD, and put the CONFIG_DRIVER under that remaining to be selectable for drivers to build the existing iova_bitmap.c file. Link: https://patch.msgid.link/r/2f4f6e116dc49ffb67ff6c5e8a7a8e789ab9e98e.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 7d4f46c linux) Signed-off-by: Koba Ko <[email protected]>

Add a new IOMMUFD_OBJ_VIOMMU with an iommufd_viommu structure to represent a slice of physical IOMMU device passed to or shared with a user space VM. This slice, now a vIOMMU object, is a group of virtualization resources of a physical IOMMU's, such as: - Security namespace for guest owned ID, e.g. guest-controlled cache tags - Non-device-affiliated event reporting, e.g. invalidation queue errors - Access to a sharable nesting parent pagetable across physical IOMMUs - Virtualization of various platforms IDs, e.g. RIDs and others - Delivery of paravirtualized invalidation - Direct assigned invalidation queues - Direct assigned interrupts Add a new viommu_alloc op in iommu_ops, for drivers to allocate their own vIOMMU structures. And this allocation also needs a free(), so add struct iommufd_viommu_ops. To simplify a vIOMMU allocation, provide a iommufd_viommu_alloc() helper. It's suggested that a driver should embed a core-level viommu structure in its driver-level viommu struct and call the iommufd_viommu_alloc() helper, meanwhile the driver can also implement a viommu ops: struct my_driver_viommu { struct iommufd_viommu core; /* driver-owned properties/features */ .... }; static const struct iommufd_viommu_ops my_driver_viommu_ops = { .free = my_driver_viommu_free, /* future ops for virtualization features */ .... }; static struct iommufd_viommu my_driver_viommu_alloc(...) { struct my_driver_viommu *my_viommu = iommufd_viommu_alloc(ictx, my_driver_viommu, core, my_driver_viommu_ops); /* Init my_viommu and related HW feature */ .... return &my_viommu->core; } static struct iommu_domain_ops my_driver_domain_ops = { .... .viommu_alloc = my_driver_viommu_alloc, }; Link: https://patch.msgid.link/r/64685e2b79dea0f1dc56f6ede04809b72d578935.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 6b22d56 linux) Signed-off-by: Koba Ko <[email protected]>

To support driver-allocated vIOMMU objects, it's required for IOMMU driver to call the provided iommufd_viommu_alloc helper to embed the core struct. However, there is no guarantee that every driver will call it and allocate objects properly. Make the iommufd_object_finalize/abort functions more robust to verify if the xarray slot indexed by the input obj->id is having an XA_ZERO_ENTRY, which is the reserved value stored by xa_alloc via iommufd_object_alloc. Link: https://patch.msgid.link/r/334bd4dde8e0a88eb30fa67eeef61827cdb546f9.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit d56d1e8 linux) Signed-off-by: Koba Ko <[email protected]>

Add a new ioctl for user space to do a vIOMMU allocation. It must be based on a nesting parent HWPT, so take its refcount. IOMMU driver wanting to support vIOMMUs must define its IOMMU_VIOMMU_TYPE_ in the uAPI header and implement a viommu_alloc op in its iommu_ops. Link: https://patch.msgid.link/r/dc2b8ba9ac935007beff07c1761c31cd097ed780.1730836219.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 4db97c2 linux) Signed-off-by: Koba Ko <[email protected]>

Allow IOMMU driver to use a vIOMMU object that holds a nesting parent hwpt/domain to allocate a nested domain. Link: https://patch.msgid.link/r/2dcdb5e405dc0deb68230564530d989d285d959c.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 69d2689 linux) Signed-off-by: Koba Ko <[email protected]>

Now a vIOMMU holds a shareable nesting parent HWPT. So, it can act like that nesting parent HWPT to allocate a nested HWPT. Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc. Also, add an iommufd_viommu_alloc_hwpt_nested helper to allocate a nested HWPT for a vIOMMU object. Since a vIOMMU object holds the parent hwpt's refcount already, increase the refcount of the vIOMMU only. Link: https://patch.msgid.link/r/a0f24f32bfada8b448d17587adcaedeeb50a67ed.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 13a7501 linux) Signed-off-by: Koba Ko <[email protected]>

Use these inline helpers to shorten those container_of lines. Note that one of them goes back and forth between iommu_domain and mock_iommu_domain, which isn't necessary. So drop its container_of. Link: https://patch.msgid.link/r/518ec64dae2e814eb29fd9f170f58a3aad56c81c.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit fd6b853 linux) Signed-off-by: Koba Ko <[email protected]>

A nested domain now can be allocated for a parent domain or for a vIOMMU object. Rework the existing allocators to prepare for the latter case. Link: https://patch.msgid.link/r/f62894ad8ccae28a8a616845947fe4b76135d79b.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 18f8199 linux) Signed-off-by: Koba Ko <[email protected]>

For an iommu_dev that can unplug (so far only this selftest does so), the viommu->iommu_dev pointer has no guarantee of its life cycle after it is copied from the idev->dev->iommu->iommu_dev. Track the user count of the iommu_dev. Postpone the exit routine using a completion, if refcount is unbalanced. The refcount inc/dec will be added in the following patch. Link: https://patch.msgid.link/r/33f28d64841b497eebef11b49a571e03103c5d24.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 8607056 linux) Signed-off-by: Koba Ko <[email protected]>

Implement the viommu alloc/free functions to increase/reduce refcount of its dependent mock iommu device. User space can verify this loop via the IOMMU_VIOMMU_TYPE_SELFTEST. Link: https://patch.msgid.link/r/9d755a215a3007d4d8d1c2513846830332db62aa.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit db70827 linux) Signed-off-by: Koba Ko <[email protected]>

Add a new iommufd_viommu FIXTURE and setup it up with a vIOMMU object. Any new vIOMMU feature will be added as a TEST_F under that. Link: https://patch.msgid.link/r/abe267c9d004b29cb1712ceba2f378209d4b7e01.1730836219.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 7156cd9 linux) Signed-off-by: Koba Ko <[email protected]>

With the introduction of the new object and its infrastructure, update the doc to reflect that and add a new graph. Link: https://patch.msgid.link/r/7e4302064e0d02137c1b1e139342affc0485ed3f.1730836219.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Bagas Sanjaya <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 87210b1 linux) Signed-off-by: Koba Ko <[email protected]>

Introduce a new IOMMUFD_OBJ_VDEVICE to represent a physical device (struct device) against a vIOMMU (struct iommufd_viommu) object in a VM. This vDEVICE object (and its structure) holds all the infos and attributes in the VM, regarding the device related to the vIOMMU. As an initial patch, add a per-vIOMMU virtual ID. This can be: - Virtual StreamID on a nested ARM SMMUv3, an index to a Stream Table - Virtual DeviceID on a nested AMD IOMMU, an index to a Device Table - Virtual RID on a nested Intel VT-D IOMMU, an index to a Context Table Potentially, this vDEVICE structure would hold some vData for Confidential Compute Architecture (CCA). Use this virtual ID to index an "vdevs" xarray that belongs to a vIOMMU object. Add a new ioctl for vDEVICE allocations. Since a vDEVICE is a connection of a device object and an iommufd_viommu object, take two refcounts in the ioctl handler. Link: https://patch.msgid.link/r/cda8fd2263166e61b8191a3b3207e0d2b08545bf.1730836308.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 0ce5c24 linux) Signed-off-by: Koba Ko <[email protected]>

Add a vdevice_alloc op to the viommu mock_viommu_ops for the coverage of IOMMU_VIOMMU_TYPE_SELFTEST allocations. Then, add a vdevice_alloc TEST_F to cover the IOMMU_VDEVICE_ALLOC ioctl. Link: https://patch.msgid.link/r/4b9607e5b86726c8baa7b89bd48123fb44104a23.1730836308.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 5778c75 linux) Signed-off-by: Koba Ko <[email protected]>

This per-vIOMMU cache_invalidate op is like the cache_invalidate_user op in struct iommu_domain_ops, but wider, supporting device cache (e.g. PCI ATC invaldiations). Link: https://patch.msgid.link/r/90138505850fa6b165135e78a87b4cc7022869a4.1730836308.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 67db79d linux) Signed-off-by: Koba Ko <[email protected]>

With a vIOMMU object, use space can flush any IOMMU related cache that can be directed via a vIOMMU object. It is similar to the IOMMU_HWPT_INVALIDATE uAPI, but can cover a wider range than IOTLB, e.g. device/desciprtor cache. Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id, and reuse the IOMMU_HWPT_INVALIDATE uAPI for vIOMMU invalidations. Drivers can define different structures for vIOMMU invalidations v.s. HWPT ones. Since both the HWPT-based and vIOMMU-based invalidation pathways check own cache invalidation op, remove the WARN_ON_ONCE in the allocator. Update the uAPI, kdoc, and selftest case accordingly. Link: https://patch.msgid.link/r/b411e2245e303b8a964f39f49453a5dff280968f.1730836308.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 54ce69e linux) Signed-off-by: Koba Ko <[email protected]>

Signed-off-by: Ankit Agrawal <[email protected]> (cherry picked from commit d4223d6db2896ec510bfc57cf018010d07ff3659 nvidia/kstable/dev/nic/iommufd_vsmmu-12122024) Signed-off-by: Koba Ko <[email protected]>

This is used for GPU memory mapping. The solution is a WAR while waiting for the upstream solution that would use dmabuf to map the entire range in a single sequence. Related topics: https://lore.kernel.org/kvm/[email protected]/ https://lore.kernel.org/kvm/[email protected]/ Signed-off-by: Ankit Agrawal <[email protected]> (cherry picked from commit d3d7b64f1a3274e5df04dee1a8062f54a3fa1116 nvidia/kstable/dev/nic/iommufd_vsmmu-12122024) Signed-off-by: Koba Ko <[email protected]>

Fix typos/spellos in kernel-doc comments for readability. Fixes: aad37e7 ("iommufd: IOCTLs for the io_pagetable") Fixes: b7a0855 ("iommu: Add new flag to explictly request PASID capable domain") Fixes: d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object") Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 7937a1b linux) Signed-off-by: Koba Ko <[email protected]>

Commit 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC") started using _iommufd_object_alloc() without importing the IOMMUFD module namespace, resulting in a modpost warning: WARNING: modpost: module arm_smmu_v3 uses symbol _iommufd_object_alloc from namespace IOMMUFD, but does not import it. Commit d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object") added another warning by using iommufd_viommu_find_dev(): WARNING: modpost: module arm_smmu_v3 uses symbol iommufd_viommu_find_dev from namespace IOMMUFD, but does not import it. Import the IOMMUFD module namespace to resolve the warnings. Fixes: 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC") Link: https://patch.msgid.link/r/20241114-arm-smmu-v3-import-iommufd-module-ns-v1-1-c551e7b972e9@kernel.org Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 6d026e6 linux) Signed-off-by: Koba Ko <[email protected]>

Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Fixes: e3b1be2 ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg") Signed-off-by: Chen Ni <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 7de7d35 linux) Signed-off-by: Koba Ko <[email protected]>

The function arm_smmu_init_strtab_2lvl uses the expression ((1 << smmu->sid_bits) - 1) to calculate the largest StreamID value. However, this fails for the maximum allowed value of SMMU_IDR1.SIDSIZE which is 32. The C standard states: "If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined." With smmu->sid_bits being 32, the prerequisites for undefined behavior are met. We observed that the value of (1 << 32) is 1 and not 0 as we initially expected. Similar bit shift operations in arm_smmu_init_strtab_linear seem to not be affected, because it appears to be unlikely for an SMMU to have SMMU_IDR1.SIDSIZE set to 32 but then not support 2-level Stream tables This issue was found by Ryan Huang <[email protected]> on our team. Fixes: ce41041 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()") Signed-off-by: Daniel Mentz <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit f63237f linux) Signed-off-by: Koba Ko <[email protected]>

During boot some of the calls to tegra241_cmdqv_get_cmdq() will happen in preemptible context. As this function calls smp_processor_id(), if CONFIG_DEBUG_PREEMPT is enabled, these calls will trigger a series of "BUG: using smp_processor_id() in preemptible" backtraces. As tegra241_cmdqv_get_cmdq() only calls smp_processor_id() to use the CPU number as a factor to balance out traffic on cmdq usage, it is safe to use raw_smp_processor_id() here. Cc: <[email protected]> Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Luis Claudio R. Goncalves <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 1f80621 linux) Signed-off-by: Koba Ko <[email protected]>

When configuring a kernel with PAGE_SIZE=4KB, depending on its setting of CONFIG_CMA_ALIGNMENT, VCMDQ_LOG2SIZE_MAX=19 could fail the alignment test and trigger a WARN_ON: WARNING: at drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3646 Call trace: arm_smmu_init_one_queue+0x15c/0x210 tegra241_cmdqv_init_structures+0x114/0x338 arm_smmu_device_probe+0xb48/0x1d90 Fix it by capping max_n_shift to CMDQ_MAX_SZ_SHIFT as SMMUv3 CMDQ does. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit a379971 linux) Signed-off-by: Koba Ko <[email protected]>

Fix a sparse warning. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 89edbe8 linux) Signed-off-by: Koba Ko <[email protected]>

…herent It's observed that, when the first 4GB of system memory was reserved, all VCMDQ allocations failed (even with the smallest qsz in the last attempt): arm-smmu-v3: found companion CMDQV device: NVDA200C:00 arm-smmu-v3: option mask 0x10 arm-smmu-v3: failed to allocate queue (0x8000 bytes) for vcmdq0 acpi NVDA200C:00: tegra241_cmdqv: Falling back to standard SMMU CMDQ arm-smmu-v3: ias 48-bit, oas 48-bit (features 0x001e1fbf) arm-smmu-v3: allocated 524288 entries for cmdq arm-smmu-v3: allocated 524288 entries for evtq arm-smmu-v3: allocated 524288 entries for priq This is because the 4GB reserved memory shifted the entire DMA zone from a lower 32-bit range (on a system without the 4GB carveout) to higher range, while the dev->coherent_dma_mask was set to DMA_BIT_MASK(32) by default. The dma_set_mask_and_coherent() call is done in arm_smmu_device_hw_probe() of the SMMU driver. So any DMA allocation from tegra241_cmdqv_probe() must wait until the coherent_dma_mask is correctly set. Move the vintf/vcmdq structure initialization routine into a different op, "init_structures". Call it at the end of arm_smmu_init_structures(), where standard SMMU queues get allocated. Most of the impl_ops aren't ready until vintf/vcmdq structure are init-ed. So replace the full impl_ops with an init_ops in __tegra241_cmdqv_probe(). And switch to tegra241_cmdqv_impl_ops later in arm_smmu_init_structures(). Note that tegra241_cmdqv_impl_ops does not link to the new init_structures op after this switch, since there is no point in having it once it's done. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Reported-by: Matt Ochs <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/530993c3aafa1b0fc3d879b8119e13c629d12e2b.1725503154.git.nicolinc@nvidia.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 483e0bd linux) Signed-off-by: Koba Ko <[email protected]>

This is likely a typo. Drop it. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/13fd3accb5b7ed6ec11cc6b7435f79f84af9f45f.1725503154.git.nicolinc@nvidia.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 2408b81 linux) Signed-off-by: Koba Ko <[email protected]>

The ioremap() function doesn't return error pointers, it returns NULL on error so update the error handling. Also just return directly instead of calling iounmap() on the NULL pointer. Calling iounmap(NULL) doesn't cause a problem on ARM but on other architectures it can trigger a warning so it'a bad habbit. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 086a3c4 linux) Signed-off-by: Koba Ko <[email protected]>

…r_header Kernel test robot reported a few trucation warnings at the snprintf: drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c: In function ‘tegra241_vintf_free_lvcmdq’: drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:56: warning: ‘%u’ directive output may be truncated writing between 1 and 5 bytes into a region of size between 3 and 11 [-Wformat-truncation=] 239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ", | ^~ drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:32: note: directive argument in the range [0, 65535] 239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:9: note: ‘snprintf’ output between 25 and 37 bytes into a destination of size 32 239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 240 | vcmdq->vintf->idx, vcmdq->idx, vcmdq->lidx); Fix by bumping up the size of the header to hold more characters. Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit db184a1 linux) Signed-off-by: Koba Ko <[email protected]>

KobaKoNvidia · 2025-01-10T07:39:31Z

Going to need Blackwell support and might as well also pick up a GH devid that Ankit posted a month back so that all GH SKUs have support.

These apply cleanly to your tree and I tried them out on a C2G4 with a single GPU passed through:

New GH devid

upstream - 12cd88a vfio/nvgrace-gpu: Add a new GH200 SKU to the devid table

https://lore.kernel.org/all/[email protected]/

Blackwell series

Note: maintainer just replied today with some minor nit feedbacks and Ankit replied- go ahead and pick what is there now, if Ankit gets a v3 out before we commit to the PR we can pick that up instead:

https://lore.kernel.org/all/[email protected]/

Done

For your SOB, wanted to point out that you’re using “koba” and “Kobak” instead of “Koba Ko”. Not sure if Canonical cares about consistency or using your full name like upstream does (I have no opinion on it).

Fixed

Please remove CONFIG_IOMMUFD_TEST / CONFIG_FAULT_INJECTION (and any derivatives) as I don’ t think those are needed/desired for a production kernel.

Are these configs also not needed?

                CONFIG_VFIO_CONTAINER=n
                CONFIG_FAILSLAB=n
                CONFIG_FAIL_FUTEX=n
                CONFIG_FAIL_IO_TIMEOUT=n
                CONFIG_FAIL_MAKE_REQUEST=n
                CONFIG_FAIL_PAGE_ALLOC=n
                CONFIG_FAULT_INJECTION_CONFIGFS=n
                CONFIG_FAULT_INJECTION_DEBUG_FS=n
                CONFIG_FAULT_INJECTION_USERCOPY=n
                CONFIG_SCSI_UFS_FAULT_INJECTION=n

Please try with IOMMUFD=m instead of =y.

Tried, it works well and pushed.

All of the OOT patches should have “NVIDIA: SAUCE” in the title, e.g. NVIDIA: SAUCE:

done

For these OOT patches, instead of exposing our internal URL in the cherry pick linage, can you instead do something like?

(cherry picked from commit aa5b07b29d395195d83d39de0b73e347fbb595c7 nvidia/kstable/dev/nic/wip/smmuv3_nesting-v4-1105202)

Or maybe even better, pick them from public tech preview

https://github.com/NVIDIA/NV-Kernels/commits/24.04_linux-nvidia-adv-6.8

Note: the "Bypass PFNMAP" patch will still need from Nic’s tree b/c of the pfn API changes that weren’t present in 6.8 tech preview

58a6044 WAR: iommufd/pages: Bypass PFNMAP

Keep.

5423c6f WAR: Expose PCI PASID capability to userspace

didn't find in 24.04%5C_linux-nvidia-adv-6.8.

973c582 KVM: arm64: determine memory type from VMA
0369ccf iommu/dma: Support MSIs through nested domains

Fixed.

a5ba867 iommu/arm-smmu-v3: Implement arm_smmu_get_msi_mapping_domain

24.04%5C_linux-nvidia-adv-6.8 has far difference.
So i keep it.

// it appied in arm-ssmu-v3.c, then use different struct.
+       return &nested_domain->s2_parent->domain;

// Nic's branch
arm-smmu-v3-iommufd.c

  return &nested_domain->vsmmu->s2_parent->domain;


> 
> * Can these be picked from upstream instead of Nic’s branch / linux-next?
> 
> > 243c87075b6b iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object 
> > db0577cd9969 iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED 
> > 0b8ee2b03952 iommu/arm-smmu-v3: Use S2FWB for NESTED domains 
> > 1b61efdacdb9 iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED 
> > bfeae856b93f iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC 

Fixed
All of these are landed in upstream linux and their SHAs are the same with upstream linux/linux-next.

> > 23cb9b8d7fb7 iommu/arm-smmu-v3: Expose the arm_smmu_attach interface 
> > 5350a7a19745 iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT 
> > 69924f3b3ddd iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info 
> > 381dc3d2887f iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS 
> > d1625022562b ACPI/IORT: Support CANWBS memory access flag 
> > 6524f01cc464 ACPICA: IORT: Update for revision Ef 
> > 58f8da6151e9 vfio: Remove VFIO_TYPE1_NESTING_IOMMU

Fixed
All of these are cherry-picked from upstream linux

> 
> * Were these needed for arm-smmu-v3 dependencies?
> 
> > d49d328627c1 iommu/tegra241-cmdqv: Limit CMDs for VCMDQs of a guest owned VINTF
> > e4ea0a7fe746 iommu/arm-smmu-v3: Start a new batch if new command is not supported
> > 0fc3b19e9086 iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV
> > ab5730394cc7 iommu/arm-smmu-v3: Add struct arm_smmu_impl_ops
> > 01806359e8bb iommu/arm-smmu-v3: Add acpi_smmu_iort_probe_model for impl
> > c069e9d843f0 iommu/arm-smmu-v3: Add ARM_SMMU_OPT_TEGRA241_CMDQV
> > b92579d70cc5 iommu/arm-smmu-v3: Make symbols public for CONFIG_TEGRA241_CMDQV
> > 14e0041466fe iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_init
> > c5b95a42b728 iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_build_sync_cmd
> > b73624126893 iommu/arm-smmu-v3: Issue a batch of commands to the same cmdq

Yes, for clean cherry-pick, these are necessary

> 
> * Can these be picked from upstream instead of Nic’s branch?
> 
> > 32e298a41eb1 Documentation: userspace-api: iommufd: Update vDEVICE
> > 83f599a5429b iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
> > c57efbde8611 iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
> > b318b38462f0 iommufd/selftest: Add mock_viommu_cache_invalidate
> > 65185cd166f2 iommufd/viommu: Add iommufd_viommu_find_dev helper
> > e9dd79a84fca iommu: Add iommu_copy_struct_from_full_user_array helper
> > b0a41e535091 iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
> > ef73641b75db iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
> > 51b5ab923670 iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
> > e0fc645b97d2 iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
> > 09be9cfe9bf5 Documentation: userspace-api: iommufd: Update vIOMMU
> > eef0c2899324 iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
> > 6c29886454f3 iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
> > e7049a846731 iommufd/selftest: Add refcount to mock_iommu_device
> > c55b772af07f iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
> > 501a75218081 iommufd/selftest: Add container_of helpers
> > ecf6a549fe19 iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
> > 8fb1cd325aa1 iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
> > 48278842d59a iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
> > e9289153d093 iommufd: Verify object in iommufd_object_finalize/abort()
> > 32ec82860075 iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
> > f5d5212cea7c iommufd: Move _iommufd_object_alloc helper to a sharable file
> > 5485214bd221 iommufd: Move struct iommufd_object to public iommufd header

Fixed
All of these are landed in upstream linux and their SHAs are the same with upstream.

> 
> * Were these needed for [IOMMU] dependencies?
> 
> > ceae8dbfc870 iommufd: Selftest coverage for IOMMU_IOAS_MAP_FILE
> > b701eae92c15 iommufd: File mappings for mdev
> > f3772c39affc iommufd: Add IOMMU_IOAS_MAP_FILE
> > c931828a56b7 iommufd: pfn_reader for file mappings
> > 72a849e0e181 iommufd: Folio subroutines
> > 8a3920c70678 iommufd: pfn_reader local variables
> > 8dc5690c05a7 iommufd: Generalize iopt_pages address
> > a3a0953bc987 iommufd: Rename uptr in iopt_alloc_iova()
> > 1e795841674c mm/gup: Add folio_add_pins()

Yes, for clean cherry-pick, these are necessary
I didn't keep these histories, if you need, i can provide it.
Thanks

NVIDIA is planning to productize a new Grace Hopper superchip SKU with device ID 0x2348. Add the SKU devid to nvgrace_gpu_vfio_pci_table. Signed-off-by: Ankit Agrawal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alex Williamson <[email protected]> (cherry picked from commit 12cd88a linux) Signed-off-by: Koba Ko <[email protected]>

nvmochs · 2025-01-10T16:26:37Z

Couple of follow-ups...

Please remove CONFIG_IOMMUFD_TEST / CONFIG_FAULT_INJECTION (and any derivatives) as I don’ t think those are needed/desired for a production kernel.

Are these configs also not needed?

                CONFIG_VFIO_CONTAINER=n
                CONFIG_FAILSLAB=n
                CONFIG_FAIL_FUTEX=n
                CONFIG_FAIL_IO_TIMEOUT=n
                CONFIG_FAIL_MAKE_REQUEST=n
                CONFIG_FAIL_PAGE_ALLOC=n
                CONFIG_FAULT_INJECTION_CONFIGFS=n
                CONFIG_FAULT_INJECTION_DEBUG_FS=n
                CONFIG_FAULT_INJECTION_USERCOPY=n
                CONFIG_SCSI_UFS_FAULT_INJECTION=n

I think CONFIG_VFIO_CONTAINER=n is needed to meet the dependency logic for CONFIG_IOMMUFD_VFIO_CONTAINER=y

The others I think can be removed and were only added to satisfy CONFIG_FAULT_INJECTION.

In your latest version of the annotations commit, it looks like you are still setting CONFIG_IOMMUFD_TEST.

5423c6f WAR: Expose PCI PASID capability to userspace

didn't find in 24.04%5C_linux-nvidia-adv-6.8.

This was because the commit title changed, in 24.04_linux-nvidia-adv-6.8 it is "WAR: vfio/pci: Report PASID capability".

Let's keep the one from Nic's tree since the title is more descriptive.

Were these needed for [IOMMU] dependencies?

ceae8db iommufd: Selftest coverage for IOMMU_IOAS_MAP_FILE
b701eae iommufd: File mappings for mdev
f3772c3 iommufd: Add IOMMU_IOAS_MAP_FILE
c931828 iommufd: pfn_reader for file mappings
72a849e iommufd: Folio subroutines
8a3920c iommufd: pfn_reader local variables
8dc5690 iommufd: Generalize iopt_pages address
a3a0953 iommufd: Rename uptr in iopt_alloc_iova()
1e79584 mm/gup: Add folio_add_pins()

Yes, for clean cherry-pick, these are necessary
I didn't keep these histories, if you need, i can provide it.
Thanks

Thanks for clarifying, I am fine if we keep these from linux-next.

nvmochs · 2025-01-14T02:11:06Z

Two more comments after looking at the latest branch today:

These 3 Blackwell patches also need NVDIA: SAUCE: to help distinguish them from upstream:

43afaa3 vfio/nvgrace-gpu: Check the HBM training and C2C link status
723fe16 vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
33158e8 vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem

When building with your latest branch on Noble + arm64, I encountered a config failure:

check-config: CONFIG_VFIO_IOMMU_TYPE1 changed from m to -: policy<{'amd64': 'm', 'arm64': 'm', 'armhf': 'm', 's390x': 'm'}>)
check-config: 1 config options have changed
make: *** [debian/rules.d/4-checks.mk:15: config-prepare-check-nvidia-64k] Error 1

Looks like this line will be needed in the debian.nvidia-6.11 annotations file:
CONFIG_VFIO_IOMMU_TYPE1 policy<{'amd64': 'm', 'arm64': '-'}>

…d for uncached resmem NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a continuation with the Grace Hopper (GH) superchip that provides a cache coherent access to CPU and GPU to each other's memory with an internal proprietary chip-to-chip cache coherent interconnect. There is a HW defect on GH systems to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region with uncached mapping carved out from the device memory. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. This is fixed on the GB systems. The presence of the fix for the HW defect is communicated by the device firmware through the DVSEC PCI config register with ID 3. The module reads this to take a different codepath on GB vs GH. Scan through the DVSEC registers to identify the correct one and use it to determine the presence of the fix. Save the value in the device's nvgrace_gpu_pci_core_device structure. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Signed-off-by: Ankit Agrawal <[email protected]> Ref: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Koba Ko <[email protected]>

…to the VM There is a HW defect on Grace Hopper (GH) to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region carved out from the device memory and mapped as uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. The Grace Blackwell systems (GB) differ from GH systems in the following aspects: 1. The aforementioned HW defect is fixed on GB systems. 2. There is a usable BAR1 (region 2 and 3) on GB systems for the GPUdirect RDMA feature [2]. This patch accommodate those GB changes by showing the 64b physical device BAR1 (region2 and 3) to the VM instead of the fake one. This takes care of both the differences. Moreover, the entire device memory is exposed on GB as cacheable to the VM as there is no carveout required. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] Signed-off-by: Ankit Agrawal <[email protected]> Ref: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Koba Ko <[email protected]>

…status In contrast to Grace Hopper systems, the HBM training has been moved out of the UEFI on the Grace Blackwell systems. This reduces the system bootup time significantly. The onus of checking whether the HBM training has completed thus falls on the module. The HBM training status can be determined from a BAR0 register. Similarly, another BAR0 register exposes the status of the CPU-GPU chip-to-chip (C2C) cache coherent interconnect. Based on testing, 30s is determined to be sufficient to ensure initialization completion on all the Grace based systems. Thus poll these register and check for 30s. If the HBM training is not complete or if the C2C link is not ready, fail the probe. While the time is not required on Grace Hopper systems, it is beneficial to make the check to ensure the device is in an expected state. Hence keeping it generalized to both the generations. Signed-off-by: Ankit Agrawal <[email protected]> Ref: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Koba Ko <[email protected]>

KobaKoNvidia · 2025-01-14T08:49:36Z

Two more comments after looking at the latest branch today:

These 3 Blackwell patches also need NVDIA: SAUCE: to help distinguish them from upstream:

43afaa3 vfio/nvgrace-gpu: Check the HBM training and C2C link status 723fe16 vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM 33158e8 vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem

Fixed "NVIDIA: SUACE: " for these three patches.

When building with your latest branch on Noble + arm64, I encountered a config failure:
check-config: CONFIG_VFIO_IOMMU_TYPE1 changed from m to -: policy<{'amd64': 'm', 'arm64': 'm', 'armhf': 'm', 's390x': 'm'}>)
check-config: 1 config options have changed
make: *** [debian/rules.d/4-checks.mk:15: config-prepare-check-nvidia-64k] Error 1
Looks like this line will be needed in the debian.nvidia-6.11 annotations file: CONFIG_VFIO_IOMMU_TYPE1 policy<{'amd64': 'm', 'arm64': '-'}>

Fixed, I ran "debian/rules updateconfigs" to check it again.

For default CMA size,
In cma_declare_contiguous_nid, it receives size=32 as a parameter and aligns it with the alignment value.
The alignment is sanitized by selecting the maximum value between the specified alignment and CMA_MIN_ALIGNMENT_BYTES.
On the current platform, both the alignment and CMA_MIN_ALIGNMENT_BYTES are 512M. Simply doubling CONFIG_CMA_SIZE_MBYTES to 64 does not result in a larger CMA.
This is why I configure it as 1024.

@mm/cma.c, cma_declare_contiguous_nid()
/* Sanitise input arguments. */
alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES)
    ...
size = ALIGN(size, alignment);

Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid(size 67108864, base 0, limit 301e80000000 alignment 0)
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid, new size 536870912, alignment 536870912, CMA_MIN_ALIGNMENT_BYTES 536870912

nvmochs · 2025-01-14T20:56:55Z

For default CMA size,
In cma_declare_contiguous_nid, it receives size=32 as a parameter and aligns it with the alignment value.
The alignment is sanitized by selecting the maximum value between the specified alignment and CMA_MIN_ALIGNMENT_BYTES.
On the current platform, both the alignment and CMA_MIN_ALIGNMENT_BYTES are 512M. Simply doubling CONFIG_CMA_SIZE_MBYTES to 64 does not result in a larger CMA.
This is why I configure it as 1024.

@mm/cma.c, cma_declare_contiguous_nid()
/* Sanitise input arguments. */
alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES)
...
size = ALIGN(size, alignment);
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid(size 67108864, base 0, limit 301e80000000 alignment 0)
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid, new size 536870912, alignment 536870912, CMA_MIN_ALIGNMENT_BYTES 536870912

This does not take into account the page size and will waste memory for the 4k kernel. So we need to set up the annotations to specify different values depending on the page size.

I tried booting the 1005 4k kernel on sj24 (since your kernel was 64k and I did not want to disturb your workflow) and found that it is using 126M of CMA memory. Part of that is to support vcmdq, which your kernel does not support. To find out how much of that is attributed to vcmdq, I rebooted with vcmdq disabled (arm-smmu-v3.disable_cmdqv=y) and found the system to be using 90M of CMA memory. Given that we will be integrating support for vcmdq into the 6.11 tech preview kernel and that 128M is the next pow2 beyond 90M, I think we should use 128M for the 4k kernel.

Therefore, I believe the annotation commit should be amended with this:
-CONFIG_CMA_SIZE_MBYTES policy<{'arm64': '1024'}>
+CONFIG_CMA_SIZE_MBYTES policy<{'arm64-generic-64k': '1024', 'arm64-generic': ‘128'}>

virtualization This adds the following config options to annotations: CONFIG_ARM_SMMU_V3_IOMMUFD=y CONFIG_IOMMUFD_DRIVER_CORE=y CONFIG_IOMMUFD_VFIO_CONTAINER=y CONFIG_NVGRACE_GPU_VFIO_PCI=m CONFIG_VFIO_CONTAINER=n CONFIG_VFIO_IOMMU_TYPE1=- CONFIG_TEGRA241_CMDQV=n For CMA size requirements, the 64K kernel configuration needs 640MB in the worst-case scenario, while the 4K kernel configuration requires 40MB. Due to the current CMA alignment requirement of 512MB on 64k kernel and 128MB on 4k kernel, use each as default For 64k kernel, CONFIG_CMA_SIZE_MBYTES=1024 For 4k kernel, CONFIG_CMA_SIZE_MBYTES=128 These config options has been defined in debian.master CONFIG_IOMMUFD=m CONFIG_IOMMU_IOPF=y Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (backported from commit 35a55f3 24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]>

KobaKoNvidia · 2025-01-15T03:36:09Z

For default CMA size,
In cma_declare_contiguous_nid, it receives size=32 as a parameter and aligns it with the alignment value.
The alignment is sanitized by selecting the maximum value between the specified alignment and CMA_MIN_ALIGNMENT_BYTES.
On the current platform, both the alignment and CMA_MIN_ALIGNMENT_BYTES are 512M. Simply doubling CONFIG_CMA_SIZE_MBYTES to 64 does not result in a larger CMA.
This is why I configure it as 1024.
@mm/cma.c, cma_declare_contiguous_nid()
/* Sanitise input arguments. */
alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES)
...
size = ALIGN(size, alignment);
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid(size 67108864, base 0, limit 301e80000000 alignment 0)
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid, new size 536870912, alignment 536870912, CMA_MIN_ALIGNMENT_BYTES 536870912

This does not take into account the page size and will waste memory for the 4k kernel. So we need to set up the annotations to specify different values depending on the page size.

I tried booting the 1005 4k kernel on sj24 (since your kernel was 64k and I did not want to disturb your workflow) and found that it is using 126M of CMA memory. Part of that is to support vcmdq, which your kernel does not support. To find out how much of that is attributed to vcmdq, I rebooted with vcmdq disabled (arm-smmu-v3.disable_cmdqv=y) and found the system to be using 90M of CMA memory. Given that we will be integrating support for vcmdq into the 6.11 tech preview kernel and that 128M is the next pow2 beyond 90M, I think we should use 128M for the 4k kernel.

Therefore, I believe the annotation commit should be amended with this: -CONFIG_CMA_SIZE_MBYTES policy<{'arm64': '1024'}> +CONFIG_CMA_SIZE_MBYTES policy<{'arm64-generic-64k': '1024', 'arm64-generic': ‘128'}>

Thanks, updated

nvmochs

No further comments from me.

Acked-by: Matthew R. Ochs [email protected]

Steve Sistare added 9 commits January 3, 2025 23:19

KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from 2ff743b to bb34a26 Compare January 9, 2025 03:44

KobaKoNvidia changed the title ~~[WIP][linux-nvidia-6.11][Backport] GPU passthrough cuda support~~ [linux-nvidia-6.11][Backport] GPU passthrough cuda support Jan 9, 2025

KobaKoNvidia requested a review from nvmochs January 9, 2025 03:50

nvmochs reviewed Jan 10, 2025

View reviewed changes

nicolinc added 17 commits January 10, 2025 05:47

ankita-nv and others added 13 commits January 10, 2025 07:08

NVIDIA: SAUCE: WAR: Expose PCI PASID capability to userspace

38ef4c1

Signed-off-by: Ankit Agrawal <[email protected]> (cherry picked from commit d4223d6db2896ec510bfc57cf018010d07ff3659 nvidia/kstable/dev/nic/iommufd_vsmmu-12122024) Signed-off-by: Koba Ko <[email protected]>

KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from bb34a26 to 14f468e Compare January 10, 2025 07:39

KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from 14f468e to f92ae2b Compare January 10, 2025 09:08

KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from f92ae2b to a06b946 Compare January 13, 2025 15:25

ankita-nv added 3 commits January 14, 2025 02:40

KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from a06b946 to 5858605 Compare January 14, 2025 08:42

KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from 5858605 to 3b27282 Compare January 15, 2025 03:35

nvmochs approved these changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[linux-nvidia-6.11][Backport] GPU passthrough cuda support #37

[linux-nvidia-6.11][Backport] GPU passthrough cuda support #37

KobaKoNvidia commented Jan 6, 2025 •

edited by nvmochs

Loading

nvmochs left a comment

KobaKoNvidia commented Jan 10, 2025 •

edited

Loading

nvmochs commented Jan 10, 2025

nvmochs commented Jan 14, 2025

KobaKoNvidia commented Jan 14, 2025

nvmochs commented Jan 14, 2025

KobaKoNvidia commented Jan 15, 2025

nvmochs left a comment

[linux-nvidia-6.11][Backport] GPU passthrough cuda support #37

Are you sure you want to change the base?

[linux-nvidia-6.11][Backport] GPU passthrough cuda support #37

Conversation

KobaKoNvidia commented Jan 6, 2025 • edited by nvmochs Loading

nvmochs left a comment

Choose a reason for hiding this comment

KobaKoNvidia commented Jan 10, 2025 • edited Loading

nvmochs commented Jan 10, 2025

nvmochs commented Jan 14, 2025

KobaKoNvidia commented Jan 14, 2025

nvmochs commented Jan 14, 2025

KobaKoNvidia commented Jan 15, 2025

nvmochs left a comment

Choose a reason for hiding this comment

KobaKoNvidia commented Jan 6, 2025 •

edited by nvmochs

Loading

KobaKoNvidia commented Jan 10, 2025 •

edited

Loading