feat: add gpu process job by lzi-a11y · Pull Request #102 · scitix/sichek

lzi-a11y · 2026-05-21T09:35:45Z

feat: add gpu process job

In RoCE environments with multi-port NICs (e.g. 8x NICs x 4 ports), the sequential mlxlink collection timed out before completing, resulting in 'No transceiver info available'. - Parallelize all mlxlink/ethtool calls in Collect() using goroutines instead of sequential loops, reducing wall time from ~48s to ~22s - Record PCIe BDF for IB-device primary net interfaces during enumeration - Skip Ethernet interfaces that share a BDF with an already-collected IB interface (filters eth_vf_rep_* duplicates that report the same module) - Fix RestartSystemdService: use separate contexts for daemon-reload and restart, add --no-block to avoid blocking on Type=notify ready signal

eth_vf_r* SR-IOV VF ethernet interfaces were being enumerated and passed to mlxlink, which took 8-11s to fail per interface (8 VFs x ~10s blocked the collection window). The same physfn check already used for IB VFs is now applied to ethernet interfaces during enumeration. Also cap concurrent mlxlink workers at 8 to reduce PCIe device lock contention when running alongside the daemon.

Replace hardcoded consts.Red in component PrintInfo with a new LevelColor(level) helper that maps: - LevelWarning -> Yellow - LevelCritical / LevelFatal -> Red - other -> Green Applied across cpu, dmesg, ethernet, gpfs, gpuevents, infiniband, nvidia, pcie/topotest, podlog, syslog, transceiver so warnings no longer share the same red color as critical/fatal events.

Introduce a separate in-cluster collector application that receives latest snapshot.json from each node via HTTP POST and persists one file per node. Node-side reporter lives as a module in sichek daemon; analysis service (outside cluster) fetches via SSH/rsync or optional HTTP GET. Storage is latest-only (no archival), no HA, no DB.

Two plans for the sichek-collector design: - 2026-04-23-sichek-collector-app.md: standalone new repo. Tasks cover scaffold, config, store interface + FS impl, middleware, POST/GET handlers, main entry, Docker, K8s manifests, E2E test. - 2026-04-23-sichek-reporter-module.md: integrates a reporter goroutine into the existing sichek daemon. Tasks cover config loader, pushOnce with gzip + retry, ticker loop, node-name resolution, DaemonService wiring, YAML config update. Both plans are self-contained with exact file paths and code, suitable for subagent-driven execution.

GetPCIETreeMin walked the upstream PCIe path and tracked the minimum width/speed value but threw away which bridge that minimum came from. The checker then had to fall back to printing the IB device's own BDF (e.g. mlx5_5(0000:aa:00.0)), which is misleading because the actual bottleneck is at one of the upstream PCIe bridges. Return the matching BDF alongside the value, store it on IBHardWareInfo as PCIETreeWidthMinBDF / PCIETreeSpeedMinBDF, and surface it in the checker detail/suggestion as "bottleneck@<upstream-bdf>". Verified on bjg45 (positive: bottleneck@0000:a7:01.0 now shown) and on lmg86/thg1/clnet36 (healthy baseline still PASS, no regression).

A degraded upstream PCIe path silently caps RDMA throughput well below the HCA's rated speed/width and is not safely ignorable, so the two checks should surface as Critical (cordon-now) rather than Warning (schedule-for-fix).

Why: zy3 (B300 NVL8 / CX8 RoCE) exposes 12 ports per IB device with the data path on ports 3/6/9/12 (eth_rX_p0..p3); the legacy ports/1 hard-coding misreports all 8 cards as DOWN. What: - spec: add device_ports/default_ports + (*InfinibandSpec).PortsFor; both default empty so existing single-port clusters stay on port 1. - collector: ib_hardware_info / ib_counters take a port arg; Collect() emits one record per (ibdev, port) keyed as "<ibdev>/p<port>". Per-port netdev resolved via cached `rdma link` output. - checkers: per-port reports use "ibdev/pN" labels; per-device checkers (fw, ofed, pcie_*, roce) dedupe on hwInfo.IBDev so multi-plane HCAs aren't reported four times. - metrics: gauges gain a "port" label; series cleanup keyed by (dev, port). - spec yaml: add `zy` cluster (8x roce_rX with [3,6,9,12], 4x mezz on port 1) + NVD0000000072 (CX8) and NVD0000000079 (CX7 mezz) HCA specs.

PCIETreeSpeedChecker compared the path-min speed read from sysfs against hcaSpec.Hardware.PCIESpeed (device link speed) and PCIETreeWidthChecker likewise used PCIEWidth. This silently worked while every supported board had link-speed == tree-speed, but breaks on CX8: the card itself links at PCIe Gen6 (64 GT/s) while the upstream switch caps the tree at 32 GT/s, so the checker reports a false positive even with a correct pcie_tree_speed/pcie_tree_width entry in spec. Switch the tree checkers to read the dedicated PCIETreeSpeedMin / PCIETreeWidthMin fields, falling back to the device-level value when the spec omits them (preserves behaviour for older boards). Also accept loose number formatting between spec and sysfs ("32" vs "32.0") via a numeric comparison helper, so spec authors don't have to mirror sysfs's trailing-zero formatting verbatim. Update the NVD0000000072 (CX8) entry to reflect reality: pcie_speed=64 GT/s, pcie_tree_speed=32.

Two regressions surfaced while running the multi-plane build through field testing on a node with no IB hardware (cl-nctl01 / clnet35): 1. NewInfinibandComponent's user-config fallback path created a default cfg but skipped the cache allocation, so cacheSize stayed 0 and the first LastInfo() call indexed cacheInfo[-1] → panic. Allocate the buffers in that branch so the daemon can keep reporting initError instead of crashing. 2. PrintInfo asserted info.(*InfinibandInfo) unconditionally and printed the opaque "invalid data type" line on the init-error path (info is nil because no health check has cached anything yet). Print the captured CheckerResults instead so operators see why initialization failed (missing spec, no IB hardware, etc.) at a glance.

The collector unconditionally skips IB devices whose name contains "mezz" (`infiniband_info.go::Collect`), so listing them under `zy.ib_devs` had no effect on the per-port hwInfo records produced for the zy cluster. The mezz board id (NVD0000000079) is still required in the top-level `hca:` map so spec validation passes — that part is unchanged.

…lane-bundle

…ultiplane-bundle

# Conflicts: # components/infiniband/checker/pcie_tree_speed.go

GPUHang has multiple known false-positive sources (pviol thermal-vs-power bug, rxpci/txpci delta semantics, strict 8/8 AND counter reset on any indicator dip). Mute via ignored_checkers until the rule is reworked. See docs/gpu-hang-detection-summary.md for the alignment notes.

device.GetComputeRunningProcesses was already called in DeviceInfo.Get but only its length was kept (as NProcess). Capture the full list now: PID, process name (/proc/<pid>/comm, silent empty on failure), and GPU memory in MiB. NProcess is derived from len(Processes) so its meaning is unchanged. Field appears in snapshot.json under gpu_devices[].compute_processes and in the reporter payload automatically since NvidiaInfo is JSON-marshaled raw. Field-tested on clnet36 (8x H20) with vLLM workers visible as VLLM::Worker_TP at ~92 GiB each.

lzi and others added 27 commits April 23, 2026 16:25

feat(service): reporter config loader with defaults

5fc3304

feat(service): Reporter.pushOnce with gzip + retry

eefd0e9

feat(service): Reporter.Run periodic loop with panic recover

604d4f8

feat(service): ResolveNodeName prefers NODE_NAME env

ac6bc83

feat(service): wire Reporter into DaemonService lifecycle

8eecbc1

feat(config): add reporter block (disabled by default)

1bfc9a7

Merge origin/feat/roce-multiplane into feat/roce-multiplane-bundle

2dc1f33

Merge origin/fix/ib-pcie-tree-show-upstream-bdf into feat/roce-multip…

621f965

…lane-bundle

Merge origin/feat/sichek-collector into feat/roce-multiplane-bundle

4b64bab

Merge origin/feat/alert-color-by-level into feat/roce-multiplane-bundle

c3fc1f3

Merge origin/fix/transceiver-roce-concurrent-collect into feat/roce-m…

4c931d0

…ultiplane-bundle

Merge origin/main (sichek config sync) into feat/roce-multiplane-bundle

d2a69fe

# Conflicts: # components/infiniband/checker/pcie_tree_speed.go

Merge branch 'scitix:main' into main

128c3f0

Merge branch 'scitix:main' into main

b39de21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add gpu process job#102

feat: add gpu process job#102
lzi-a11y wants to merge 27 commits into
scitix:mainfrom
lzi-a11y:fix/disable-gpuhang-alert

lzi-a11y commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lzi-a11y commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant