Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[linux-nvidia-6.11][Backport] GPU passthrough cuda support #37

Open
wants to merge 84 commits into
base: 24.04_linux-nvidia-6.11
Choose a base branch
from

Conversation

KobaKoNvidia
Copy link
Collaborator

@KobaKoNvidia KobaKoNvidia commented Jan 6, 2025

[Description]
Backport patches from [0] [1] [2] to enable GPU passthrough for CUDA.

[0] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=55:57
[1] https://docs.google.com/spreadsheets/d/1oLDUSup7xtSJDVLUNKxpCRaXWp4AqTSZfcQdJ_aYAUc/edit?gid=297279919#gid=297279919&range=60:61
[2] https://git-master.nvidia.com/r/plugins/gitiles/linux-stable/+log/refs/heads/dev/nic/iommufd_vsmmu-12122024

[Test plan]

  1. boot up host
  2. boot up VMs in host with 1 GPU, 2 GPUs, 3 GPUs and 4 GPUs.
  3. Run the following to do the basic checks [3]
//
// Get CPU device
//
$ lspci | grep 3D
//
// Show GPU info
//
$ nvidia-smi
//
// Result must be passed
//
$ /root/r570/tests/runtime/gflops/gflops
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t texture_simple
$ /root/r570/tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host
  1. Check the host's dmesg,

[Misc]

  1. Passed arm64&amd64 buildin in Noble.[5]
  2. there're errors in host's dmesg, [4]
  3. there're errors in vm[6]

[3], [5], logs for VMs, host's dmesg and buit logs,
https://drive.google.com/drive/folders/1bJYyfSoIR_BmtW20BWp178WXX8tOXhHo?usp=sharing'
[4], These are also observed in 6.11.0-1002-nvidia-64k and 6.8.0-1005-nvidia-adv-64k

  • fixed by CMA=1G added in kernel command line.
Jan 06 07:15:54 localhost kernel: arm-smmu-v3 arm-smmu-v3.36.auto: allocated 524288 entries for cmdq
Jan 06 07:15:54 localhost kernel: cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
Jan 06 07:15:54 localhost kernel: cma: number of available pages: => 0 free of 8192 total pages
  • As per Matt, this's a known issue.
Jan 06 07:15:54 localhost kernel: pci 0009:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
Jan 06 07:15:54 localhost kernel: pci 0009:01:00.0: DOE: [2c8] failed to create mailbox: -5

[6] As per Matt, these are not observed in two sockets platforms.

 845 [    0.685329] arm-smmu-v3 arm-smmu-v3.18.auto: allocated 524288 entries for e     vtq^M
 846 [    0.686323] genirq: Setting trigger mode 1 for irq 84 failed (gic_set_type+     0x0/0x200)^M
 847 [    0.686905] arm-smmu-v3 arm-smmu-v3.18.auto: failed to enable evtq irq^M
 848 [    0.687289] genirq: Setting trigger mode 1 for irq 86 failed (gic_set_type+     0x0/0x200)^M
 849 [    0.687770] arm-smmu-v3 arm-smmu-v3.18.auto: failed to enable gerror irq^M
 850 [    0.688240] arm-smmu-v3 arm-smmu-v3.19.auto: option mask 0x0^M

Steve Sistare added 9 commits January 3, 2025 23:19
Export a function that adds pins to an already-pinned huge-page folio.
This allows any range of small pages within the folio to be unpinned later.
For example, pages pinned via memfd_pin_folios and modified by
folio_add_pins could be unpinned via unpin_user_page(s).

Link: https://patch.msgid.link/r/[email protected]
Suggested-by: Jason Gunthorpe <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit a2ad1b8
linux-next)
Signed-off-by: Koba Ko <[email protected]>
iopt_alloc_iova() takes a uptr argument but only checks for its alignment.
Generalize this to an unsigned address, which can be the offset from the
start of a file in a subsequent patch.  No functional change.

Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 32383c0
linux-next)
Signed-off-by: Koba Ko <[email protected]>
The starting address in iopt_pages is currently a __user *uptr.
Generalize to allow other types of addresses.  Refactor iopt_alloc_pages()
and iopt_map_user_pages() into address-type specific and common functions.

Link: https://patch.msgid.link/r/[email protected]
Suggested-by: Nicolin Chen <[email protected]>
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 99ff06d
linux-next)
Signed-off-by: Koba Ko <[email protected]>
Add local variables for common sub-expressions needed by a subsequent
patch.  No functional change.

Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit c27f0a6
linux-next)
Signed-off-by: Koba Ko <[email protected]>
Add subroutines for copying folios to a batch.

Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit ed9178f
linux-next)
Signed-off-by: Koba Ko <[email protected]>
Extend pfn_reader_user() to pin file mappings, by calling
memfd_pin_folios().  Repin at small page granularity, and fill the batch
from folios.  Expand folios to upages for the iopt_pages_fill() path.

Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 92687c7
linux-next)
Signed-off-by: Koba Ko <[email protected]>
Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to
register memory by passing a memfd plus offset and length.  Implement it
using the memfd_pin_folios() kAPI.

Link: https://patch.msgid.link/r/[email protected]
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit f4986a7
linux-next)
Signed-off-by: Koba Ko <[email protected]>
Support file mappings for mediated devices, aka mdevs.  Access is
initiated by the vfio_pin_pages() and vfio_dma_rw() kernel interfaces.

Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 976a40c
linux-next)
Signed-off-by: Koba Ko <[email protected]>
Add test cases to exercise IOMMU_IOAS_MAP_FILE.

Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 0bcceb1
linux-next)
Signed-off-by: Koba Ko <[email protected]>
@KobaKoNvidia KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from 2ff743b to bb34a26 Compare January 9, 2025 03:44
@KobaKoNvidia KobaKoNvidia changed the title [WIP][linux-nvidia-6.11][Backport] GPU passthrough cuda support [linux-nvidia-6.11][Backport] GPU passthrough cuda support Jan 9, 2025
@KobaKoNvidia KobaKoNvidia requested a review from nvmochs January 9, 2025 03:50
Copy link
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Going to need Blackwell support and might as well also pick up a GH devid that Ankit posted a month back so that all GH SKUs have support.

  • For your SOB, wanted to point out that you’re using “koba” and “Kobak” instead of “Koba Ko”. Not sure if Canonical cares about consistency or using your full name like upstream does (I have no opinion on it).

  • Please remove CONFIG_IOMMUFD_TEST / CONFIG_FAULT_INJECTION (and any derivatives) as I don’ t think those are needed/desired for a production kernel.

  • Please try with IOMMUFD=m instead of =y.

  • All of the OOT patches should have “NVIDIA: SAUCE” in the title, e.g. NVIDIA: SAUCE: <existing title>

  • For these OOT patches, instead of exposing our internal URL in the cherry pick linage, can you instead do something like?

    (cherry picked from commit aa5b07b29d395195d83d39de0b73e347fbb595c7 nvidia/kstable/dev/nic/wip/smmuv3_nesting-v4-1105202)

  • Or maybe even better, pick them from public tech preview?

58a6044 WAR: iommufd/pages: Bypass PFNMAP
5423c6f WAR: Expose PCI PASID capability to userspace
973c582 KVM: arm64: determine memory type from VMA
0369ccf iommu/dma: Support MSIs through nested domains
a5ba867 iommu/arm-smmu-v3: Implement arm_smmu_get_msi_mapping_domain

  • Can these be picked from upstream instead of Nic’s branch / linux-next?

243c870 iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
db0577c iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
0b8ee2b iommu/arm-smmu-v3: Use S2FWB for NESTED domains
1b61efd iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
bfeae85 iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
23cb9b8 iommu/arm-smmu-v3: Expose the arm_smmu_attach interface
5350a7a iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
69924f3 iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
381dc3d iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
d162502 ACPI/IORT: Support CANWBS memory access flag
6524f01 ACPICA: IORT: Update for revision E.f
58f8da6 vfio: Remove VFIO_TYPE1_NESTING_IOMMU

  • Were these needed for arm-smmu-v3 dependencies?

d49d328 iommu/tegra241-cmdqv: Limit CMDs for VCMDQs of a guest owned VINTF
e4ea0a7 iommu/arm-smmu-v3: Start a new batch if new command is not supported
0fc3b19 iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV
ab57303 iommu/arm-smmu-v3: Add struct arm_smmu_impl_ops
0180635 iommu/arm-smmu-v3: Add acpi_smmu_iort_probe_model for impl
c069e9d iommu/arm-smmu-v3: Add ARM_SMMU_OPT_TEGRA241_CMDQV
b92579d iommu/arm-smmu-v3: Make symbols public for CONFIG_TEGRA241_CMDQV
14e0041 iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_init
c5b95a4 iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_build_sync_cmd
b736241 iommu/arm-smmu-v3: Issue a batch of commands to the same cmdq

  • Can these be picked from upstream instead of Nic’s branch?

32e298a Documentation: userspace-api: iommufd: Update vDEVICE
83f599a iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
c57efbd iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
b318b38 iommufd/selftest: Add mock_viommu_cache_invalidate
65185cd iommufd/viommu: Add iommufd_viommu_find_dev helper
e9dd79a iommu: Add iommu_copy_struct_from_full_user_array helper
b0a41e5 iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
ef73641 iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
51b5ab9 iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
e0fc645 iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
09be9cf Documentation: userspace-api: iommufd: Update vIOMMU
eef0c28 iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
6c29886 iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
e7049a8 iommufd/selftest: Add refcount to mock_iommu_device
c55b772 iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
501a752 iommufd/selftest: Add container_of helpers
ecf6a54 iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
8fb1cd3 iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
4827884 iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
e928915 iommufd: Verify object in iommufd_object_finalize/abort()
32ec828 iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
f5d5212 iommufd: Move _iommufd_object_alloc helper to a sharable file
5485214 iommufd: Move struct iommufd_object to public iommufd header

  • Were these needed for [IOMMU] dependencies?

ceae8db iommufd: Selftest coverage for IOMMU_IOAS_MAP_FILE
b701eae iommufd: File mappings for mdev
f3772c3 iommufd: Add IOMMU_IOAS_MAP_FILE
c931828 iommufd: pfn_reader for file mappings
72a849e iommufd: Folio subroutines
8a3920c iommufd: pfn_reader local variables
8dc5690 iommufd: Generalize iopt_pages address
a3a0953 iommufd: Rename uptr in iopt_alloc_iova()
1e79584 mm/gup: Add folio_add_pins()

Prepare for an embedded structure design for driver-level iommufd_viommu
objects:
    // include/linux/iommufd.h
    struct iommufd_viommu {
        struct iommufd_object obj;
        ....
    };

    // Some IOMMU driver
    struct iommu_driver_viommu {
        struct iommufd_viommu core;
        ....
    };

It has to expose struct iommufd_object and enum iommufd_object_type from
the core-level private header to the public iommufd header.

Link: https://patch.msgid.link/r/54a43b0768089d690104530754f499ca05ce0074.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit d1b3dad
linux)
Signed-off-by: Koba Ko <[email protected]>
The following patch will add a new vIOMMU allocator that will require this
_iommufd_object_alloc to be sharable with IOMMU drivers (and iommufd too).

Add a new driver.c file that will be built with CONFIG_IOMMUFD_DRIVER_CORE
selected by CONFIG_IOMMUFD, and put the CONFIG_DRIVER under that remaining
to be selectable for drivers to build the existing iova_bitmap.c file.

Link: https://patch.msgid.link/r/2f4f6e116dc49ffb67ff6c5e8a7a8e789ab9e98e.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 7d4f46c
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a new IOMMUFD_OBJ_VIOMMU with an iommufd_viommu structure to represent
a slice of physical IOMMU device passed to or shared with a user space VM.
This slice, now a vIOMMU object, is a group of virtualization resources of
a physical IOMMU's, such as:
 - Security namespace for guest owned ID, e.g. guest-controlled cache tags
 - Non-device-affiliated event reporting, e.g. invalidation queue errors
 - Access to a sharable nesting parent pagetable across physical IOMMUs
 - Virtualization of various platforms IDs, e.g. RIDs and others
 - Delivery of paravirtualized invalidation
 - Direct assigned invalidation queues
 - Direct assigned interrupts

Add a new viommu_alloc op in iommu_ops, for drivers to allocate their own
vIOMMU structures. And this allocation also needs a free(), so add struct
iommufd_viommu_ops.

To simplify a vIOMMU allocation, provide a iommufd_viommu_alloc() helper.
It's suggested that a driver should embed a core-level viommu structure in
its driver-level viommu struct and call the iommufd_viommu_alloc() helper,
meanwhile the driver can also implement a viommu ops:
    struct my_driver_viommu {
        struct iommufd_viommu core;
        /* driver-owned properties/features */
        ....
    };

    static const struct iommufd_viommu_ops my_driver_viommu_ops = {
        .free = my_driver_viommu_free,
        /* future ops for virtualization features */
        ....
    };

    static struct iommufd_viommu my_driver_viommu_alloc(...)
    {
        struct my_driver_viommu *my_viommu =
                iommufd_viommu_alloc(ictx, my_driver_viommu, core,
                                     my_driver_viommu_ops);
        /* Init my_viommu and related HW feature */
        ....
        return &my_viommu->core;
    }

    static struct iommu_domain_ops my_driver_domain_ops = {
        ....
        .viommu_alloc = my_driver_viommu_alloc,
    };

Link: https://patch.msgid.link/r/64685e2b79dea0f1dc56f6ede04809b72d578935.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 6b22d56
linux)
Signed-off-by: Koba Ko <[email protected]>
To support driver-allocated vIOMMU objects, it's required for IOMMU driver
to call the provided iommufd_viommu_alloc helper to embed the core struct.
However, there is no guarantee that every driver will call it and allocate
objects properly.

Make the iommufd_object_finalize/abort functions more robust to verify if
the xarray slot indexed by the input obj->id is having an XA_ZERO_ENTRY,
which is the reserved value stored by xa_alloc via iommufd_object_alloc.

Link: https://patch.msgid.link/r/334bd4dde8e0a88eb30fa67eeef61827cdb546f9.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit d56d1e8
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a new ioctl for user space to do a vIOMMU allocation. It must be based
on a nesting parent HWPT, so take its refcount.

IOMMU driver wanting to support vIOMMUs must define its IOMMU_VIOMMU_TYPE_
in the uAPI header and implement a viommu_alloc op in its iommu_ops.

Link: https://patch.msgid.link/r/dc2b8ba9ac935007beff07c1761c31cd097ed780.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 4db97c2
linux)
Signed-off-by: Koba Ko <[email protected]>
Allow IOMMU driver to use a vIOMMU object that holds a nesting parent
hwpt/domain to allocate a nested domain.

Link: https://patch.msgid.link/r/2dcdb5e405dc0deb68230564530d989d285d959c.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 69d2689
linux)
Signed-off-by: Koba Ko <[email protected]>
Now a vIOMMU holds a shareable nesting parent HWPT. So, it can act like
that nesting parent HWPT to allocate a nested HWPT.

Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.

Also, add an iommufd_viommu_alloc_hwpt_nested helper to allocate a nested
HWPT for a vIOMMU object. Since a vIOMMU object holds the parent hwpt's
refcount already, increase the refcount of the vIOMMU only.

Link: https://patch.msgid.link/r/a0f24f32bfada8b448d17587adcaedeeb50a67ed.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 13a7501
linux)
Signed-off-by: Koba Ko <[email protected]>
Use these inline helpers to shorten those container_of lines.

Note that one of them goes back and forth between iommu_domain and
mock_iommu_domain, which isn't necessary. So drop its container_of.

Link: https://patch.msgid.link/r/518ec64dae2e814eb29fd9f170f58a3aad56c81c.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit fd6b853
linux)
Signed-off-by: Koba Ko <[email protected]>
A nested domain now can be allocated for a parent domain or for a vIOMMU
object. Rework the existing allocators to prepare for the latter case.

Link: https://patch.msgid.link/r/f62894ad8ccae28a8a616845947fe4b76135d79b.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 18f8199
linux)
Signed-off-by: Koba Ko <[email protected]>
For an iommu_dev that can unplug (so far only this selftest does so), the
viommu->iommu_dev pointer has no guarantee of its life cycle after it is
copied from the idev->dev->iommu->iommu_dev.

Track the user count of the iommu_dev. Postpone the exit routine using a
completion, if refcount is unbalanced. The refcount inc/dec will be added
in the following patch.

Link: https://patch.msgid.link/r/33f28d64841b497eebef11b49a571e03103c5d24.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 8607056
linux)
Signed-off-by: Koba Ko <[email protected]>
Implement the viommu alloc/free functions to increase/reduce refcount of
its dependent mock iommu device. User space can verify this loop via the
IOMMU_VIOMMU_TYPE_SELFTEST.

Link: https://patch.msgid.link/r/9d755a215a3007d4d8d1c2513846830332db62aa.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit db70827
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a new iommufd_viommu FIXTURE and setup it up with a vIOMMU object.

Any new vIOMMU feature will be added as a TEST_F under that.

Link: https://patch.msgid.link/r/abe267c9d004b29cb1712ceba2f378209d4b7e01.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 7156cd9
linux)
Signed-off-by: Koba Ko <[email protected]>
With the introduction of the new object and its infrastructure, update the
doc to reflect that and add a new graph.

Link: https://patch.msgid.link/r/7e4302064e0d02137c1b1e139342affc0485ed3f.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 87210b1
linux)
Signed-off-by: Koba Ko <[email protected]>
Introduce a new IOMMUFD_OBJ_VDEVICE to represent a physical device (struct
device) against a vIOMMU (struct iommufd_viommu) object in a VM.

This vDEVICE object (and its structure) holds all the infos and attributes
in the VM, regarding the device related to the vIOMMU.

As an initial patch, add a per-vIOMMU virtual ID. This can be:
 - Virtual StreamID on a nested ARM SMMUv3, an index to a Stream Table
 - Virtual DeviceID on a nested AMD IOMMU, an index to a Device Table
 - Virtual RID on a nested Intel VT-D IOMMU, an index to a Context Table
Potentially, this vDEVICE structure would hold some vData for Confidential
Compute Architecture (CCA). Use this virtual ID to index an "vdevs" xarray
that belongs to a vIOMMU object.

Add a new ioctl for vDEVICE allocations. Since a vDEVICE is a connection
of a device object and an iommufd_viommu object, take two refcounts in the
ioctl handler.

Link: https://patch.msgid.link/r/cda8fd2263166e61b8191a3b3207e0d2b08545bf.1730836308.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 0ce5c24
linux)
Signed-off-by: Koba Ko <[email protected]>
Add a vdevice_alloc op to the viommu mock_viommu_ops for the coverage of
IOMMU_VIOMMU_TYPE_SELFTEST allocations. Then, add a vdevice_alloc TEST_F
to cover the IOMMU_VDEVICE_ALLOC ioctl.

Link: https://patch.msgid.link/r/4b9607e5b86726c8baa7b89bd48123fb44104a23.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 5778c75
linux)
Signed-off-by: Koba Ko <[email protected]>
This per-vIOMMU cache_invalidate op is like the cache_invalidate_user op
in struct iommu_domain_ops, but wider, supporting device cache (e.g. PCI
ATC invaldiations).

Link: https://patch.msgid.link/r/90138505850fa6b165135e78a87b4cc7022869a4.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 67db79d
linux)
Signed-off-by: Koba Ko <[email protected]>
With a vIOMMU object, use space can flush any IOMMU related cache that can
be directed via a vIOMMU object. It is similar to the IOMMU_HWPT_INVALIDATE
uAPI, but can cover a wider range than IOTLB, e.g. device/desciprtor cache.

Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id,
and reuse the IOMMU_HWPT_INVALIDATE uAPI for vIOMMU invalidations. Drivers
can define different structures for vIOMMU invalidations v.s. HWPT ones.

Since both the HWPT-based and vIOMMU-based invalidation pathways check own
cache invalidation op, remove the WARN_ON_ONCE in the allocator.

Update the uAPI, kdoc, and selftest case accordingly.

Link: https://patch.msgid.link/r/b411e2245e303b8a964f39f49453a5dff280968f.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 54ce69e
linux)
Signed-off-by: Koba Ko <[email protected]>
ankita-nv and others added 13 commits January 10, 2025 07:08
Signed-off-by: Ankit Agrawal <[email protected]>
(cherry picked from commit d4223d6db2896ec510bfc57cf018010d07ff3659 nvidia/kstable/dev/nic/iommufd_vsmmu-12122024)
Signed-off-by: Koba Ko <[email protected]>
This is used for GPU memory mapping. The solution is a WAR while waiting
for the upstream solution that would use dmabuf to map the entire range
in a single sequence.

Related topics:
https://lore.kernel.org/kvm/[email protected]/
https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: Ankit Agrawal <[email protected]>
(cherry picked from commit d3d7b64f1a3274e5df04dee1a8062f54a3fa1116 nvidia/kstable/dev/nic/iommufd_vsmmu-12122024)
Signed-off-by: Koba Ko <[email protected]>
Fix typos/spellos in kernel-doc comments for readability.

Fixes: aad37e7 ("iommufd: IOCTLs for the io_pagetable")
Fixes: b7a0855 ("iommu: Add new flag to explictly request PASID capable domain")
Fixes: d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object")
Link: https://patch.msgid.link/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 7937a1b
linux)
Signed-off-by: Koba Ko <[email protected]>
Commit 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC")
started using _iommufd_object_alloc() without importing the IOMMUFD
module namespace, resulting in a modpost warning:

  WARNING: modpost: module arm_smmu_v3 uses symbol _iommufd_object_alloc from namespace IOMMUFD, but does not import it.

Commit d68beb2 ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE
using a VIOMMU object") added another warning by using
iommufd_viommu_find_dev():

  WARNING: modpost: module arm_smmu_v3 uses symbol iommufd_viommu_find_dev from namespace IOMMUFD, but does not import it.

Import the IOMMUFD module namespace to resolve the warnings.

Fixes: 69d9b31 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC")
Link: https://patch.msgid.link/r/20241114-arm-smmu-v3-import-iommufd-module-ns-v1-1-c551e7b972e9@kernel.org
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 6d026e6
linux)
Signed-off-by: Koba Ko <[email protected]>
Replace comma between expressions with semicolons.

Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.

Found by inspection.
No functional change intended.
Compile tested only.

Fixes: e3b1be2 ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg")
Signed-off-by: Chen Ni <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Lu Baolu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 7de7d35
linux)
Signed-off-by: Koba Ko <[email protected]>
The function arm_smmu_init_strtab_2lvl uses the expression

((1 << smmu->sid_bits) - 1)

to calculate the largest StreamID value. However, this fails for the
maximum allowed value of SMMU_IDR1.SIDSIZE which is 32. The C standard
states:

"If the value of the right operand is negative or is greater than or
equal to the width of the promoted left operand, the behavior is
undefined."

With smmu->sid_bits being 32, the prerequisites for undefined behavior
are met.  We observed that the value of (1 << 32) is 1 and not 0 as we
initially expected.

Similar bit shift operations in arm_smmu_init_strtab_linear seem to not
be affected, because it appears to be unlikely for an SMMU to have
SMMU_IDR1.SIDSIZE set to 32 but then not support 2-level Stream tables

This issue was found by Ryan Huang <[email protected]> on our team.

Fixes: ce41041 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()")
Signed-off-by: Daniel Mentz <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit f63237f
linux)
Signed-off-by: Koba Ko <[email protected]>
During boot some of the calls to tegra241_cmdqv_get_cmdq() will happen
in preemptible context. As this function calls smp_processor_id(), if
CONFIG_DEBUG_PREEMPT is enabled, these calls will trigger a series of
"BUG: using smp_processor_id() in preemptible" backtraces.

As tegra241_cmdqv_get_cmdq() only calls smp_processor_id() to use the
CPU number as a factor to balance out traffic on cmdq usage, it is safe
to use raw_smp_processor_id() here.

Cc: <[email protected]>
Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Luis Claudio R. Goncalves <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 1f80621
linux)
Signed-off-by: Koba Ko <[email protected]>
When configuring a kernel with PAGE_SIZE=4KB, depending on its setting of
CONFIG_CMA_ALIGNMENT, VCMDQ_LOG2SIZE_MAX=19 could fail the alignment test
and trigger a WARN_ON:
    WARNING: at drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3646
    Call trace:
     arm_smmu_init_one_queue+0x15c/0x210
     tegra241_cmdqv_init_structures+0x114/0x338
     arm_smmu_device_probe+0xb48/0x1d90

Fix it by capping max_n_shift to CMDQ_MAX_SZ_SHIFT as SMMUv3 CMDQ does.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit a379971
linux)
Signed-off-by: Koba Ko <[email protected]>
Fix a sparse warning.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 89edbe8
linux)
Signed-off-by: Koba Ko <[email protected]>
…herent

It's observed that, when the first 4GB of system memory was reserved, all
VCMDQ allocations failed (even with the smallest qsz in the last attempt):
    arm-smmu-v3: found companion CMDQV device: NVDA200C:00
    arm-smmu-v3: option mask 0x10
    arm-smmu-v3: failed to allocate queue (0x8000 bytes) for vcmdq0
    acpi NVDA200C:00: tegra241_cmdqv: Falling back to standard SMMU CMDQ
    arm-smmu-v3: ias 48-bit, oas 48-bit (features 0x001e1fbf)
    arm-smmu-v3: allocated 524288 entries for cmdq
    arm-smmu-v3: allocated 524288 entries for evtq
    arm-smmu-v3: allocated 524288 entries for priq

This is because the 4GB reserved memory shifted the entire DMA zone from a
lower 32-bit range (on a system without the 4GB carveout) to higher range,
while the dev->coherent_dma_mask was set to DMA_BIT_MASK(32) by default.

The dma_set_mask_and_coherent() call is done in arm_smmu_device_hw_probe()
of the SMMU driver. So any DMA allocation from tegra241_cmdqv_probe() must
wait until the coherent_dma_mask is correctly set.

Move the vintf/vcmdq structure initialization routine into a different op,
"init_structures". Call it at the end of arm_smmu_init_structures(), where
standard SMMU queues get allocated.

Most of the impl_ops aren't ready until vintf/vcmdq structure are init-ed.
So replace the full impl_ops with an init_ops in __tegra241_cmdqv_probe().

And switch to tegra241_cmdqv_impl_ops later in arm_smmu_init_structures().
Note that tegra241_cmdqv_impl_ops does not link to the new init_structures
op after this switch, since there is no point in having it once it's done.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: Matt Ochs <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/530993c3aafa1b0fc3d879b8119e13c629d12e2b.1725503154.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 483e0bd
linux)
Signed-off-by: Koba Ko <[email protected]>
This is likely a typo. Drop it.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/13fd3accb5b7ed6ec11cc6b7435f79f84af9f45f.1725503154.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 2408b81
linux)
Signed-off-by: Koba Ko <[email protected]>
The ioremap() function doesn't return error pointers, it returns NULL
on error so update the error handling.  Also just return directly
instead of calling iounmap() on the NULL pointer.  Calling
iounmap(NULL) doesn't cause a problem on ARM but on other architectures
it can trigger a warning so it'a bad habbit.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 086a3c4
linux)
Signed-off-by: Koba Ko <[email protected]>
…r_header

Kernel test robot reported a few trucation warnings at the snprintf:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:
	In function ‘tegra241_vintf_free_lvcmdq’:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:56:
	warning: ‘%u’ directive output may be truncated writing between 1 and
	5 bytes into a region of size between 3 and 11 [-Wformat-truncation=]
  239 |         snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
      |                                                        ^~
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:32: note: directive argument
	in the range [0, 65535]
  239 |         snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
      |                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:9: note: ‘snprintf’ output
	between 25 and 37 bytes into a destination of size 32
  239 |         snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  240 |                  vcmdq->vintf->idx, vcmdq->idx, vcmdq->lidx);

Fix by bumping up the size of the header to hold more characters.

Fixes: 918eb5c ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit db184a1
linux)
Signed-off-by: Koba Ko <[email protected]>
@KobaKoNvidia KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from bb34a26 to 14f468e Compare January 10, 2025 07:39
@KobaKoNvidia
Copy link
Collaborator Author

KobaKoNvidia commented Jan 10, 2025

  • Going to need Blackwell support and might as well also pick up a GH devid that Ankit posted a month back so that all GH SKUs have support.

Done

  • For your SOB, wanted to point out that you’re using “koba” and “Kobak” instead of “Koba Ko”. Not sure if Canonical cares about consistency or using your full name like upstream does (I have no opinion on it).

Fixed

  • Please remove CONFIG_IOMMUFD_TEST / CONFIG_FAULT_INJECTION (and any derivatives) as I don’ t think those are needed/desired for a production kernel.

Are these configs also not needed?

                CONFIG_VFIO_CONTAINER=n
                CONFIG_FAILSLAB=n
                CONFIG_FAIL_FUTEX=n
                CONFIG_FAIL_IO_TIMEOUT=n
                CONFIG_FAIL_MAKE_REQUEST=n
                CONFIG_FAIL_PAGE_ALLOC=n
                CONFIG_FAULT_INJECTION_CONFIGFS=n
                CONFIG_FAULT_INJECTION_DEBUG_FS=n
                CONFIG_FAULT_INJECTION_USERCOPY=n
                CONFIG_SCSI_UFS_FAULT_INJECTION=n
  • Please try with IOMMUFD=m instead of =y.

Tried, it works well and pushed.

  • All of the OOT patches should have “NVIDIA: SAUCE” in the title, e.g. NVIDIA: SAUCE:

done

  • For these OOT patches, instead of exposing our internal URL in the cherry pick linage, can you instead do something like?

    (cherry picked from commit aa5b07b29d395195d83d39de0b73e347fbb595c7 nvidia/kstable/dev/nic/wip/smmuv3_nesting-v4-1105202)

  • Or maybe even better, pick them from public tech preview

58a6044 WAR: iommufd/pages: Bypass PFNMAP

Keep.

5423c6f WAR: Expose PCI PASID capability to userspace

didn't find in 24.04%5C_linux-nvidia-adv-6.8.

973c582 KVM: arm64: determine memory type from VMA
0369ccf iommu/dma: Support MSIs through nested domains

Fixed.

a5ba867 iommu/arm-smmu-v3: Implement arm_smmu_get_msi_mapping_domain

24.04%5C_linux-nvidia-adv-6.8 has far difference.
So i keep it.

// it appied in arm-ssmu-v3.c, then use different struct.
+       return &nested_domain->s2_parent->domain;

// Nic's branch
arm-smmu-v3-iommufd.c

  •   return &nested_domain->vsmmu->s2_parent->domain;
    

> 
> * Can these be picked from upstream instead of Nic’s branch / linux-next?
> 
> > 243c87075b6b iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object 
> > db0577cd9969 iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED 
> > 0b8ee2b03952 iommu/arm-smmu-v3: Use S2FWB for NESTED domains 
> > 1b61efdacdb9 iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED 
> > bfeae856b93f iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC 

Fixed
All of these are landed in upstream linux and their SHAs are the same with upstream linux/linux-next.

> > 23cb9b8d7fb7 iommu/arm-smmu-v3: Expose the arm_smmu_attach interface 
> > 5350a7a19745 iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT 
> > 69924f3b3ddd iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info 
> > 381dc3d2887f iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS 
> > d1625022562b ACPI/IORT: Support CANWBS memory access flag 
> > 6524f01cc464 ACPICA: IORT: Update for revision Ef 
> > 58f8da6151e9 vfio: Remove VFIO_TYPE1_NESTING_IOMMU

Fixed
All of these are cherry-picked from upstream linux

> 
> * Were these needed for arm-smmu-v3 dependencies?
> 
> > d49d328627c1 iommu/tegra241-cmdqv: Limit CMDs for VCMDQs of a guest owned VINTF
> > e4ea0a7fe746 iommu/arm-smmu-v3: Start a new batch if new command is not supported
> > 0fc3b19e9086 iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV
> > ab5730394cc7 iommu/arm-smmu-v3: Add struct arm_smmu_impl_ops
> > 01806359e8bb iommu/arm-smmu-v3: Add acpi_smmu_iort_probe_model for impl
> > c069e9d843f0 iommu/arm-smmu-v3: Add ARM_SMMU_OPT_TEGRA241_CMDQV
> > b92579d70cc5 iommu/arm-smmu-v3: Make symbols public for CONFIG_TEGRA241_CMDQV
> > 14e0041466fe iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_init
> > c5b95a42b728 iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_build_sync_cmd
> > b73624126893 iommu/arm-smmu-v3: Issue a batch of commands to the same cmdq

Yes, for clean cherry-pick, these are necessary

> 
> * Can these be picked from upstream instead of Nic’s branch?
> 
> > 32e298a41eb1 Documentation: userspace-api: iommufd: Update vDEVICE
> > 83f599a5429b iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
> > c57efbde8611 iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
> > b318b38462f0 iommufd/selftest: Add mock_viommu_cache_invalidate
> > 65185cd166f2 iommufd/viommu: Add iommufd_viommu_find_dev helper
> > e9dd79a84fca iommu: Add iommu_copy_struct_from_full_user_array helper
> > b0a41e535091 iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
> > ef73641b75db iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
> > 51b5ab923670 iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
> > e0fc645b97d2 iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
> > 09be9cfe9bf5 Documentation: userspace-api: iommufd: Update vIOMMU
> > eef0c2899324 iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
> > 6c29886454f3 iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
> > e7049a846731 iommufd/selftest: Add refcount to mock_iommu_device
> > c55b772af07f iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
> > 501a75218081 iommufd/selftest: Add container_of helpers
> > ecf6a549fe19 iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
> > 8fb1cd325aa1 iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
> > 48278842d59a iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
> > e9289153d093 iommufd: Verify object in iommufd_object_finalize/abort()
> > 32ec82860075 iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
> > f5d5212cea7c iommufd: Move _iommufd_object_alloc helper to a sharable file
> > 5485214bd221 iommufd: Move struct iommufd_object to public iommufd header

Fixed
All of these are landed in upstream linux and their SHAs are the same with upstream.

> 
> * Were these needed for [IOMMU] dependencies?
> 
> > ceae8dbfc870 iommufd: Selftest coverage for IOMMU_IOAS_MAP_FILE
> > b701eae92c15 iommufd: File mappings for mdev
> > f3772c39affc iommufd: Add IOMMU_IOAS_MAP_FILE
> > c931828a56b7 iommufd: pfn_reader for file mappings
> > 72a849e0e181 iommufd: Folio subroutines
> > 8a3920c70678 iommufd: pfn_reader local variables
> > 8dc5690c05a7 iommufd: Generalize iopt_pages address
> > a3a0953bc987 iommufd: Rename uptr in iopt_alloc_iova()
> > 1e795841674c mm/gup: Add folio_add_pins()

Yes, for clean cherry-pick, these are necessary
I didn't keep these histories, if you need, i can provide it.
Thanks

NVIDIA is planning to productize a new Grace Hopper superchip
SKU with device ID 0x2348.

Add the SKU devid to nvgrace_gpu_vfio_pci_table.

Signed-off-by: Ankit Agrawal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
(cherry picked from commit 12cd88a
linux)
Signed-off-by: Koba Ko <[email protected]>
@KobaKoNvidia KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from 14f468e to f92ae2b Compare January 10, 2025 09:08
@nvmochs
Copy link
Collaborator

nvmochs commented Jan 10, 2025

Couple of follow-ups...

  • Please remove CONFIG_IOMMUFD_TEST / CONFIG_FAULT_INJECTION (and any derivatives) as I don’ t think those are needed/desired for a production kernel.

Are these configs also not needed?

                CONFIG_VFIO_CONTAINER=n
                CONFIG_FAILSLAB=n
                CONFIG_FAIL_FUTEX=n
                CONFIG_FAIL_IO_TIMEOUT=n
                CONFIG_FAIL_MAKE_REQUEST=n
                CONFIG_FAIL_PAGE_ALLOC=n
                CONFIG_FAULT_INJECTION_CONFIGFS=n
                CONFIG_FAULT_INJECTION_DEBUG_FS=n
                CONFIG_FAULT_INJECTION_USERCOPY=n
                CONFIG_SCSI_UFS_FAULT_INJECTION=n

I think CONFIG_VFIO_CONTAINER=n is needed to meet the dependency logic for CONFIG_IOMMUFD_VFIO_CONTAINER=y

The others I think can be removed and were only added to satisfy CONFIG_FAULT_INJECTION.

In your latest version of the annotations commit, it looks like you are still setting CONFIG_IOMMUFD_TEST.


5423c6f WAR: Expose PCI PASID capability to userspace

didn't find in 24.04%5C_linux-nvidia-adv-6.8.

This was because the commit title changed, in 24.04_linux-nvidia-adv-6.8 it is "WAR: vfio/pci: Report PASID capability".

Let's keep the one from Nic's tree since the title is more descriptive.


  • Were these needed for [IOMMU] dependencies?

ceae8db iommufd: Selftest coverage for IOMMU_IOAS_MAP_FILE
b701eae iommufd: File mappings for mdev
f3772c3 iommufd: Add IOMMU_IOAS_MAP_FILE
c931828 iommufd: pfn_reader for file mappings
72a849e iommufd: Folio subroutines
8a3920c iommufd: pfn_reader local variables
8dc5690 iommufd: Generalize iopt_pages address
a3a0953 iommufd: Rename uptr in iopt_alloc_iova()
1e79584 mm/gup: Add folio_add_pins()

Yes, for clean cherry-pick, these are necessary
I didn't keep these histories, if you need, i can provide it.
Thanks

Thanks for clarifying, I am fine if we keep these from linux-next.

@KobaKoNvidia KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from f92ae2b to a06b946 Compare January 13, 2025 15:25
@nvmochs
Copy link
Collaborator

nvmochs commented Jan 14, 2025

Two more comments after looking at the latest branch today:

  • These 3 Blackwell patches also need NVDIA: SAUCE: to help distinguish them from upstream:

43afaa3 vfio/nvgrace-gpu: Check the HBM training and C2C link status
723fe16 vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
33158e8 vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem

  • When building with your latest branch on Noble + arm64, I encountered a config failure:
check-config: CONFIG_VFIO_IOMMU_TYPE1 changed from m to -: policy<{'amd64': 'm', 'arm64': 'm', 'armhf': 'm', 's390x': 'm'}>)
check-config: 1 config options have changed
make: *** [debian/rules.d/4-checks.mk:15: config-prepare-check-nvidia-64k] Error 1

Looks like this line will be needed in the debian.nvidia-6.11 annotations file:
CONFIG_VFIO_IOMMU_TYPE1 policy<{'amd64': 'm', 'arm64': '-'}>

…d for uncached resmem

NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip cache coherent interconnect.

There is a HW defect on GH systems to support the Multi-Instance
GPU (MIG) feature [1] that necessiated the presence of a 1G region
with uncached mapping carved out from the device memory. The 1G
region is shown as a fake BAR (comprising region 2 and 3) to
workaround the issue. This is fixed on the GB systems.

The presence of the fix for the HW defect is communicated by the
device firmware through the DVSEC PCI config register with ID 3.
The module reads this to take a different codepath on GB vs GH.

Scan through the DVSEC registers to identify the correct one and use
it to determine the presence of the fix. Save the value in the device's
nvgrace_gpu_pci_core_device structure.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Signed-off-by: Ankit Agrawal <[email protected]>
Ref: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Koba Ko <[email protected]>
…to the VM

There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.

The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.

Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Signed-off-by: Ankit Agrawal <[email protected]>
Ref: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Koba Ko <[email protected]>
…status

In contrast to Grace Hopper systems, the HBM training has been moved
out of the UEFI on the Grace Blackwell systems. This reduces the system
bootup time significantly.

The onus of checking whether the HBM training has completed thus falls
on the module.

The HBM training status can be determined from a BAR0 register.
Similarly, another BAR0 register exposes the status of the CPU-GPU
chip-to-chip (C2C) cache coherent interconnect.

Based on testing, 30s is determined to be sufficient to ensure
initialization completion on all the Grace based systems. Thus poll
these register and check for 30s. If the HBM training is not complete
or if the C2C link is not ready, fail the probe.

While the time is not required on Grace Hopper systems, it is
beneficial to make the check to ensure the device is in an
expected state. Hence keeping it generalized to both the generations.

Signed-off-by: Ankit Agrawal <[email protected]>
Ref: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Koba Ko <[email protected]>
@KobaKoNvidia KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from a06b946 to 5858605 Compare January 14, 2025 08:42
@KobaKoNvidia
Copy link
Collaborator Author

Two more comments after looking at the latest branch today:

  • These 3 Blackwell patches also need NVDIA: SAUCE: to help distinguish them from upstream:

43afaa3 vfio/nvgrace-gpu: Check the HBM training and C2C link status 723fe16 vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM 33158e8 vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem

Fixed "NVIDIA: SUACE: " for these three patches.

  • When building with your latest branch on Noble + arm64, I encountered a config failure:
check-config: CONFIG_VFIO_IOMMU_TYPE1 changed from m to -: policy<{'amd64': 'm', 'arm64': 'm', 'armhf': 'm', 's390x': 'm'}>)
check-config: 1 config options have changed
make: *** [debian/rules.d/4-checks.mk:15: config-prepare-check-nvidia-64k] Error 1

Looks like this line will be needed in the debian.nvidia-6.11 annotations file: CONFIG_VFIO_IOMMU_TYPE1 policy<{'amd64': 'm', 'arm64': '-'}>

Fixed, I ran "debian/rules updateconfigs" to check it again.

For default CMA size,
In cma_declare_contiguous_nid, it receives size=32 as a parameter and aligns it with the alignment value.
The alignment is sanitized by selecting the maximum value between the specified alignment and CMA_MIN_ALIGNMENT_BYTES.
On the current platform, both the alignment and CMA_MIN_ALIGNMENT_BYTES are 512M. Simply doubling CONFIG_CMA_SIZE_MBYTES to 64 does not result in a larger CMA.
This is why I configure it as 1024.

@mm/cma.c, cma_declare_contiguous_nid()
/* Sanitise input arguments. */
alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES)
    ...
size = ALIGN(size, alignment);

Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid(size 67108864, base 0, limit 301e80000000 alignment 0)
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid, new size 536870912, alignment 536870912, CMA_MIN_ALIGNMENT_BYTES 536870912

@nvmochs
Copy link
Collaborator

nvmochs commented Jan 14, 2025

For default CMA size,
In cma_declare_contiguous_nid, it receives size=32 as a parameter and aligns it with the alignment value.
The alignment is sanitized by selecting the maximum value between the specified alignment and CMA_MIN_ALIGNMENT_BYTES.
On the current platform, both the alignment and CMA_MIN_ALIGNMENT_BYTES are 512M. Simply doubling CONFIG_CMA_SIZE_MBYTES to 64 does not result in a larger CMA.
This is why I configure it as 1024.

@mm/cma.c, cma_declare_contiguous_nid()
/* Sanitise input arguments. */
alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES)
...
size = ALIGN(size, alignment);
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid(size 67108864, base 0, limit 301e80000000 alignment 0)
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid, new size 536870912, alignment 536870912, CMA_MIN_ALIGNMENT_BYTES 536870912

This does not take into account the page size and will waste memory for the 4k kernel. So we need to set up the annotations to specify different values depending on the page size.

I tried booting the 1005 4k kernel on sj24 (since your kernel was 64k and I did not want to disturb your workflow) and found that it is using 126M of CMA memory. Part of that is to support vcmdq, which your kernel does not support. To find out how much of that is attributed to vcmdq, I rebooted with vcmdq disabled (arm-smmu-v3.disable_cmdqv=y) and found the system to be using 90M of CMA memory. Given that we will be integrating support for vcmdq into the 6.11 tech preview kernel and that 128M is the next pow2 beyond 90M, I think we should use 128M for the 4k kernel.

Therefore, I believe the annotation commit should be amended with this:
-CONFIG_CMA_SIZE_MBYTES policy<{'arm64': '1024'}>
+CONFIG_CMA_SIZE_MBYTES policy<{'arm64-generic-64k': '1024', 'arm64-generic': ‘128'}>

virtualization

This adds the following config options to annotations:

            CONFIG_ARM_SMMU_V3_IOMMUFD=y
            CONFIG_IOMMUFD_DRIVER_CORE=y
            CONFIG_IOMMUFD_VFIO_CONTAINER=y
            CONFIG_NVGRACE_GPU_VFIO_PCI=m
            CONFIG_VFIO_CONTAINER=n
            CONFIG_VFIO_IOMMU_TYPE1=-
            CONFIG_TEGRA241_CMDQV=n

For CMA size requirements, the 64K kernel configuration needs 640MB
in the worst-case scenario, while the 4K kernel configuration requires 40MB.
Due to the current CMA alignment requirement of 512MB on 64k kernel and
128MB on 4k kernel, use each as default
            For 64k kernel, CONFIG_CMA_SIZE_MBYTES=1024
            For 4k kernel, CONFIG_CMA_SIZE_MBYTES=128

These config options has been defined in debian.master
            CONFIG_IOMMUFD=m
            CONFIG_IOMMU_IOPF=y

Signed-off-by: Matthew R. Ochs <[email protected]>
Acked-by: Kai-Heng Feng <[email protected]>
Acked-by: Koba Ko <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]>
(backported from commit 35a55f3 24.04_linux-nvidia-adv-6.8-next)
Signed-off-by: Koba Ko <[email protected]>
@KobaKoNvidia KobaKoNvidia force-pushed the dgx10901_24.04_linux-nvidia-6.11_gpuPassthroughCudaSupport branch from 5858605 to 3b27282 Compare January 15, 2025 03:35
@KobaKoNvidia
Copy link
Collaborator Author

For default CMA size,
In cma_declare_contiguous_nid, it receives size=32 as a parameter and aligns it with the alignment value.
The alignment is sanitized by selecting the maximum value between the specified alignment and CMA_MIN_ALIGNMENT_BYTES.
On the current platform, both the alignment and CMA_MIN_ALIGNMENT_BYTES are 512M. Simply doubling CONFIG_CMA_SIZE_MBYTES to 64 does not result in a larger CMA.
This is why I configure it as 1024.
@mm/cma.c, cma_declare_contiguous_nid()
/* Sanitise input arguments. */
alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES)
...
size = ALIGN(size, alignment);
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid(size 67108864, base 0, limit 301e80000000 alignment 0)
Jan 14 06:09:42 localhost kernel: cma: cma_declare_contiguous_nid, new size 536870912, alignment 536870912, CMA_MIN_ALIGNMENT_BYTES 536870912

This does not take into account the page size and will waste memory for the 4k kernel. So we need to set up the annotations to specify different values depending on the page size.

I tried booting the 1005 4k kernel on sj24 (since your kernel was 64k and I did not want to disturb your workflow) and found that it is using 126M of CMA memory. Part of that is to support vcmdq, which your kernel does not support. To find out how much of that is attributed to vcmdq, I rebooted with vcmdq disabled (arm-smmu-v3.disable_cmdqv=y) and found the system to be using 90M of CMA memory. Given that we will be integrating support for vcmdq into the 6.11 tech preview kernel and that 128M is the next pow2 beyond 90M, I think we should use 128M for the 4k kernel.

Therefore, I believe the annotation commit should be amended with this: -CONFIG_CMA_SIZE_MBYTES policy<{'arm64': '1024'}> +CONFIG_CMA_SIZE_MBYTES policy<{'arm64-generic-64k': '1024', 'arm64-generic': ‘128'}>

Thanks, updated

Copy link
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further comments from me.

Acked-by: Matthew R. Ochs [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.