Skip to content

Latest commit

 

History

History
179 lines (134 loc) · 9.57 KB

F2_Software_Performance_Optimization_Guide.md

File metadata and controls

179 lines (134 loc) · 9.57 KB

Software Performance Optimizations for F2.48xlarge Instances

This guide outlines strategies for maximizing performance on f2.48xlarge instances through effective CPU-to-FPGA mapping and NUMA optimization. In dual-socket configurations, implementing NUMA-aware techniques is essential for minimizing latency and maximizing PCIe bandwidth between CPUs and FPGA accelerators. The optimizations within this document do not apply to f2.6xlarge and f2.12xlarge instances because all CPU resources, memory, NVMe devices, and FPGA devices are on a single NUMA node.

Quick Start Guide

To optimize your application's performance running on an f2.48xlarge instance, refer to the Script to Construct a FPGA to NUMA Node and vCPU Mapping or follow the steps below:

  1. Determine the FPGA slot numbers using fpga-describe-local-image
  2. Locate the optimal vCPUs for your slot in the mapping table
  3. Apply CPU pinning using either:
    • numactl --localalloc --physcpubind <vCPU list> <bash command> command
    • Application-specific CPU affinity settings

F2 Instance Overview

The f2.48xlarge instance consists of 2 AMD Milan CPUs in a dual-socket configuration with 192 vCPUs, 2,048 GiB (2 TiB) of memory, 7,600 GiB across 8 NVMe SSDs, and 8 FPGAs. Each socket directly connects to 96 vCPUs, 1 TiB of memory, 3,800 GiB across 4 NVMe SSDs, and 4 FPGAs. The networking interface directly connects to CPU socket 0.

Linux System Tools

The following Linux tools help verify system configuration:

Tool Purpose Installation/Usage
lspci -tv View PCI topology Built-in
lstopo Visualize system topology sudo apt install hwloc
lscpu -e View CPU/NUMA mapping Built-in
numactl -H Show NUMA configuration Built-in

Note: For FPGA visibility, use lstopo --whole-io or lstopo-no-graphics --whole-io

NUMA Best Practices for F2 Instances

NUMA (Non-Uniform Memory Access) is a computer memory design where memory access time depends on the memory location relative to a processor. In a NUMA system, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The f2.48xlarge instance has two NUMA nodes, each associated with a CPU socket and all colocated devices:

  • Each NUMA node contains:
    • 96 vCPUs
    • 1 TiB of local memory
    • 4 FPGAs
    • 4 NVMe drives
  • Memory access characteristics:
    • Local memory access (same NUMA node) results in the lowest latency
    • Remote memory access (different NUMA node) results in higher latency

Why NUMA Matters

NUMA awareness is crucial for performance because:

  1. Local memory access is significantly faster than remote access
  2. PCIe devices (like FPGAs) perform best when the controlling process runs on CPUs in the same NUMA node
  3. Memory bandwidth is higher for local access
  4. Improper NUMA alignment can cause significant performance degradation

This is why the vCPU to FPGA mapping table in this guide is important - it ensures your application uses the optimal CPU cores for each FPGA device.

Identifying the CPU <-> FPGA NUMA Mapping

The bus:device:function (BDF) mapping of FPGA devices is in slot order. On an f2.48xlarge instance, the lowest BDF hex value will be slot 0 and the highest BDF hex value will be slot 7. The fpga-describe-local-image command will display this:

$ sudo fpga-describe-local-image -S 0 -H
Type  FpgaImageSlot  FpgaImageId             StatusName    StatusCode   ErrorName    ErrorCode   ShVersion
AFI          0       No AFI                  cleared           1        ok               0       0x10162423
Type  FpgaImageSlot  VendorId    DeviceId    DBDF
AFIDEVICE    0       0x1d0f      0x9048      0000:9f:00.0```

The NUMA node for this device can be found in the Linux PCI hierarchy:

```bash
$ cat /sys/bus/pci/devices/0000\:9f\:00.0/numa_node
1

The vCPU NUMA node mappings can be found with numactl -H. An f2.48xlarge instance would display the following:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 1023962 MB
node 0 free: 1021988 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 1 size: 1023981 MB
node 1 free: 1022619 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10

Ideal vCPU <-> FPGA Mapping for Optimal PCIe Performance

Processes can be pinned to particular vCPUs by using Linux tools such as numactl. The "Optimal vCPUs" below refer to the optimal 16 vCPUs (shared L3 cache) for that slot. Below is a table of the optimal vCPUs for each FPGA slot on an f2.48xlarge instance.

FPGA Slot # NUMA Node Optimal vCPUs Example of numactl command Colocated vCPUs
0 1 48-55, 144-151 numactl --localalloc --physcpubind 48,55 48-95, 144-191
1 1 56-63, 152-159 numactl --localalloc --physcpubind 56,63 48-95, 144-191
2 1 64-71, 160-167 numactl --localalloc --physcpubind 64,71 48-95, 144-191
3 1 72-79, 168-175 numactl --localalloc --physcpubind 72,79 48-95, 144-191
4 0 0-7, 96-103 numactl --localalloc --physcpubind 0,7 0-47, 96-143
5 0 8-15, 104-111 numactl --localalloc --physcpubind 8,15 0-47, 96-143
6 0 16-23, 112-119 numactl --localalloc --physcpubind 16,23 0-47, 96-143
7 0 24-31, 120-127 numactl --localalloc --physcpubind 24,31 0-47, 96-143

NOTE: in place of the --physcpubind <vCPU list> argument, users can also pass in --cpunodebind <NUMA node ID

Script to Construct an FPGA to NUMA Node and vCPU Mapping

Execute the following bash command to construct a table that maps the FPGA devices to their NUMA node and colocated vCPUs:

(
    printf "%-8s %-11s %-11s %-13s %-10s %s\n" "SLOT" "VENDOR_ID" "DEVICE_ID" "BDF" "NUMA_NODE" "vCPUs_(Physical,Virtual)"
    sudo fpga-describe-local-image-slots | while read -r dev slot vendor device bdf; do
        numa_node=$(sudo cat /sys/bus/pci/devices/$bdf/numa_node)
        vcpus=$(lscpu -p=CPU,NODE | grep "^[0-9]*,$numa_node$" | cut -d',' -f1)

        # Organize CPUs into ranges
        physical_cpus=$(echo "$vcpus" | awk -v ORS='' '
            function print_range(start, end) {
                if (start == end) return start;
                return start "-" end;
            }
            NR==1 {start=end=$1; prev=$1; next}
            {
                if ($1 != prev+1) {
                    printf "%s,", print_range(start, end);
                    start=$1;
                }
                end=$1;
                prev=$1;
            }
            END {printf "%s", print_range(start, end)}
        ')

        printf "%-8s %-11s %-11s %-13s %-10s %s\n" \
            "$slot" "$vendor" "$device" "$bdf" "$numa_node" "$physical_cpus"
    done
) | column -t

Sample output from a f2.48xlarge instance:

SLOT  VENDOR_ID  DEVICE_ID  BDF           NUMA_NODE  vCPUs_(Physical,Virtual)
0     0x1d0f     0x9048     0000:9f:00.0  1          48-95,144-191
1     0x1d0f     0x9048     0000:a1:00.0  1          48-95,144-191
2     0x1d0f     0x9048     0000:a3:00.0  1          48-95,144-191
3     0x1d0f     0x9048     0000:a5:00.0  1          48-95,144-191
4     0x1d0f     0x9048     0000:ae:00.0  0          0-47,96-143
5     0x1d0f     0x9048     0000:b0:00.0  0          0-47,96-143
6     0x1d0f     0x9048     0000:b2:00.0  0          0-47,96-143
7     0x1d0f     0x9048     0000:b4:00.0  0          0-47,96-143

Frequently Asked Questions (FAQ)

How can I investigate system performance issues?

To Investigate low performance (high latency or decreased bandwidth)

  • Verify the process accessing the FPGA is being served by the expected vCPU with top -H -p <pid>
  • Check the NUMA alignment by hand with numactl --hardware
  • Monitor the PCIe traffic on AMD processors with tools such as amd-uprof

To Investigate inconsistent performance (latency or bandwidth spikes)

  • Ensure no other processes are using the same vCPUs with tools such as htop
  • Monitor system resources with sar
  • Verify memory allocation with numastat

Where can I reach out for additional help?

For any issues with the devkit documentation or code, please open a GitHub issue with all steps to reproduce.

For questions about F2 instances, please open a re:Post issue with the 'FPGA Development' tag.