Skip to content

feat: add gpu process job#102

Open
lzi-a11y wants to merge 27 commits into
scitix:mainfrom
lzi-a11y:fix/disable-gpuhang-alert
Open

feat: add gpu process job#102
lzi-a11y wants to merge 27 commits into
scitix:mainfrom
lzi-a11y:fix/disable-gpuhang-alert

Conversation

@lzi-a11y
Copy link
Copy Markdown
Contributor

feat: add gpu process job

lzi and others added 27 commits April 23, 2026 16:25
In RoCE environments with multi-port NICs (e.g. 8x NICs x 4 ports),
the sequential mlxlink collection timed out before completing, resulting
in 'No transceiver info available'.

- Parallelize all mlxlink/ethtool calls in Collect() using goroutines
  instead of sequential loops, reducing wall time from ~48s to ~22s
- Record PCIe BDF for IB-device primary net interfaces during enumeration
- Skip Ethernet interfaces that share a BDF with an already-collected IB
  interface (filters eth_vf_rep_* duplicates that report the same module)
- Fix RestartSystemdService: use separate contexts for daemon-reload and
  restart, add --no-block to avoid blocking on Type=notify ready signal
eth_vf_r* SR-IOV VF ethernet interfaces were being enumerated and passed
to mlxlink, which took 8-11s to fail per interface (8 VFs x ~10s blocked
the collection window). The same physfn check already used for IB VFs is
now applied to ethernet interfaces during enumeration.

Also cap concurrent mlxlink workers at 8 to reduce PCIe device lock
contention when running alongside the daemon.
Replace hardcoded consts.Red in component PrintInfo with a new
LevelColor(level) helper that maps:
  - LevelWarning              -> Yellow
  - LevelCritical / LevelFatal -> Red
  - other                     -> Green

Applied across cpu, dmesg, ethernet, gpfs, gpuevents, infiniband,
nvidia, pcie/topotest, podlog, syslog, transceiver so warnings no
longer share the same red color as critical/fatal events.
Introduce a separate in-cluster collector application that receives
latest snapshot.json from each node via HTTP POST and persists one
file per node. Node-side reporter lives as a module in sichek daemon;
analysis service (outside cluster) fetches via SSH/rsync or optional
HTTP GET. Storage is latest-only (no archival), no HA, no DB.
Two plans for the sichek-collector design:

- 2026-04-23-sichek-collector-app.md: standalone new repo. Tasks cover
  scaffold, config, store interface + FS impl, middleware, POST/GET
  handlers, main entry, Docker, K8s manifests, E2E test.

- 2026-04-23-sichek-reporter-module.md: integrates a reporter goroutine
  into the existing sichek daemon. Tasks cover config loader, pushOnce
  with gzip + retry, ticker loop, node-name resolution, DaemonService
  wiring, YAML config update.

Both plans are self-contained with exact file paths and code, suitable
for subagent-driven execution.
GetPCIETreeMin walked the upstream PCIe path and tracked the minimum
width/speed value but threw away which bridge that minimum came from.
The checker then had to fall back to printing the IB device's own BDF
(e.g. mlx5_5(0000:aa:00.0)), which is misleading because the actual
bottleneck is at one of the upstream PCIe bridges.

Return the matching BDF alongside the value, store it on IBHardWareInfo
as PCIETreeWidthMinBDF / PCIETreeSpeedMinBDF, and surface it in the
checker detail/suggestion as "bottleneck@<upstream-bdf>".

Verified on bjg45 (positive: bottleneck@0000:a7:01.0 now shown) and on
lmg86/thg1/clnet36 (healthy baseline still PASS, no regression).
A degraded upstream PCIe path silently caps RDMA throughput well below
the HCA's rated speed/width and is not safely ignorable, so the two
checks should surface as Critical (cordon-now) rather than Warning
(schedule-for-fix).
Why: zy3 (B300 NVL8 / CX8 RoCE) exposes 12 ports per IB device with the
data path on ports 3/6/9/12 (eth_rX_p0..p3); the legacy ports/1
hard-coding misreports all 8 cards as DOWN.

What:
- spec: add device_ports/default_ports + (*InfinibandSpec).PortsFor; both
  default empty so existing single-port clusters stay on port 1.
- collector: ib_hardware_info / ib_counters take a port arg; Collect()
  emits one record per (ibdev, port) keyed as "<ibdev>/p<port>".  Per-port
  netdev resolved via cached `rdma link` output.
- checkers: per-port reports use "ibdev/pN" labels; per-device checkers
  (fw, ofed, pcie_*, roce) dedupe on hwInfo.IBDev so multi-plane HCAs
  aren't reported four times.
- metrics: gauges gain a "port" label; series cleanup keyed by (dev, port).
- spec yaml: add `zy` cluster (8x roce_rX with [3,6,9,12], 4x mezz on
  port 1) + NVD0000000072 (CX8) and NVD0000000079 (CX7 mezz) HCA specs.
PCIETreeSpeedChecker compared the path-min speed read from sysfs against
hcaSpec.Hardware.PCIESpeed (device link speed) and PCIETreeWidthChecker
likewise used PCIEWidth.  This silently worked while every supported board
had link-speed == tree-speed, but breaks on CX8: the card itself links at
PCIe Gen6 (64 GT/s) while the upstream switch caps the tree at 32 GT/s,
so the checker reports a false positive even with a correct
pcie_tree_speed/pcie_tree_width entry in spec.

Switch the tree checkers to read the dedicated PCIETreeSpeedMin /
PCIETreeWidthMin fields, falling back to the device-level value when the
spec omits them (preserves behaviour for older boards).

Also accept loose number formatting between spec and sysfs ("32" vs
"32.0") via a numeric comparison helper, so spec authors don't have to
mirror sysfs's trailing-zero formatting verbatim.

Update the NVD0000000072 (CX8) entry to reflect reality:
pcie_speed=64 GT/s, pcie_tree_speed=32.
Two regressions surfaced while running the multi-plane build through
field testing on a node with no IB hardware (cl-nctl01 / clnet35):

1. NewInfinibandComponent's user-config fallback path created a default
   cfg but skipped the cache allocation, so cacheSize stayed 0 and the
   first LastInfo() call indexed cacheInfo[-1] → panic. Allocate the
   buffers in that branch so the daemon can keep reporting initError
   instead of crashing.

2. PrintInfo asserted info.(*InfinibandInfo) unconditionally and printed
   the opaque "invalid data type" line on the init-error path (info is
   nil because no health check has cached anything yet). Print the
   captured CheckerResults instead so operators see why initialization
   failed (missing spec, no IB hardware, etc.) at a glance.
The collector unconditionally skips IB devices whose name contains
"mezz" (`infiniband_info.go::Collect`), so listing them under
`zy.ib_devs` had no effect on the per-port hwInfo records produced
for the zy cluster.  The mezz board id (NVD0000000079) is still
required in the top-level `hca:` map so spec validation passes —
that part is unchanged.
# Conflicts:
#	components/infiniband/checker/pcie_tree_speed.go
GPUHang has multiple known false-positive sources (pviol thermal-vs-power
bug, rxpci/txpci delta semantics, strict 8/8 AND counter reset on any
indicator dip). Mute via ignored_checkers until the rule is reworked.
See docs/gpu-hang-detection-summary.md for the alignment notes.
device.GetComputeRunningProcesses was already called in DeviceInfo.Get
but only its length was kept (as NProcess). Capture the full list now:
PID, process name (/proc/<pid>/comm, silent empty on failure), and GPU
memory in MiB. NProcess is derived from len(Processes) so its meaning
is unchanged. Field appears in snapshot.json under
gpu_devices[].compute_processes and in the reporter payload automatically
since NvidiaInfo is JSON-marshaled raw. Field-tested on clnet36 (8x H20)
with vLLM workers visible as VLLM::Worker_TP at ~92 GiB each.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant