refactor(discovery): add NodeDiscovery interface and consolidate node listing by ev-shindin · Pull Request #1013 · llm-d/llm-d-workload-variant-autoscaler

ev-shindin · 2026-04-15T09:12:40Z

Summary

Pure refactor of internal/discovery/k8s_with_gpu_operator.go — no behavior change for existing callers. Extracts a single internal node-listing helper, adds a new NodeDiscovery interface that exposes per-node labels, and re-implements the existing Discover and discoverNodeGPUTypes methods as projections over the new helper.

This unblocks upcoming label-aware features (namespace-scoped limiter, node-aware bin packing) that need node.Labels alongside accelerator capacity, without forcing a third near-identical vendor-loop node query.

Motivation

K8sWithGpuOperator currently has two near-identical vendor-loop node queries:

Discover() (lines 33-99) — builds map[nodeName]map[model]AcceleratorModelInfo
discoverNodeGPUTypes() (lines 147-188) — builds map[nodeName]string

Both loop over vendors, build the same selector, list nodes, iterate results, and differ only in the projection. Upcoming work (namespace-scoped limiter, epic limiter improvements) needs a third projection — per-node info including node.Labels. Adding a third independent vendor-loop would lock in the duplication pattern. This PR consolidates node discovery into a single primitive with multiple public projections.

Behavior preservation

This is a pure refactor. Verified invariants:

Discover() output shape and contents unchanged — existing tests pass without modification
DiscoverUsage() unchanged — still uses discoverNodeGPUTypes internally
discoverNodeGPUTypes() multi-vendor tie-break preserved (intel > amd > nvidia if multiple labels present). Locked down by new TestDiscoverNodeGPUTypes_MultiVendorNode_LastWins.
WVA_NODE_SELECTOR handling unchanged; error on invalid selector is now explicitly tested
Vendor iteration order (nvidia.com, amd.com, intel.com) preserved
Multi-vendor nodes merged into a single NodeInfo entry with both accelerators
Nodes without Allocatable resource still included with Count=0
Non-GPU nodes excluded from DiscoverNodes output (same as Discover)

Tests

All existing discovery tests pass unchanged (behavior-equivalence check).

Eight new tests added:

TestDiscoverNodes_SingleVendor — basic labels + accelerators capture
TestDiscoverNodes_MultiVendorNode — merged single entry
TestDiscoverNodes_RespectsWVANodeSelector — sharding env var respected
TestDiscoverNodes_NodeWithGPULabelButNoAllocatable — Count=0 preserved
TestDiscoverNodes_EmptyCluster — empty input → empty output
TestDiscoverNodes_ExcludesCPUOnlyNodes — filters consistent with Discover
TestDiscoverNodes_InvalidWVANodeSelectorReturnsError — error path tested for both DiscoverNodes and Discover
TestDiscoverNodes_LabelsAreIndependentCopy — mutation-safety contract
TestDiscoverNodeGPUTypes_MultiVendorNode_LastWins — behavior-preservation regression test

Verification

go build ./... — clean
go test ./internal/... — all packages pass (including envtest suites for actuator, controller, engines/saturation)
golangci-lint run ./internal/discovery/... ./internal/engines/pipeline/... — 0 issues

Non-goals

No new behavior for existing callers
DiscoverNodes not consumed anywhere yet; consumption happens in the follow-up namespace-scoped limiter PR
No changes outside internal/discovery/ and the pipeline test mock

Follow-ups (out of scope)

Namespace-scoped GPU inventory and limiter (issue) — will consume DiscoverNodes
Epic: Limiter improvements

Test plan

go build ./... clean
go test ./internal/discovery/... — 18/18 pass
go test ./internal/engines/pipeline/... — all pass (mock updated)
go test ./internal/engines/saturation/... (envtest) — all pass
golangci-lint run ./internal/discovery/... ./internal/engines/pipeline/... — 0 issues
CI pipelines green on PR

… listing Extracts a single internal helper (listGPUNodes) that queries GPU-bearing nodes across all supported vendors (NVIDIA, AMD, Intel) and returns a canonical per-node view. The existing Discover() and discoverNodeGPUTypes() methods are re-implemented as thin projections over this helper, removing near-identical vendor-loop duplication. Adds a new NodeInfo type (Name, Labels, Accelerators) and a new NodeDiscovery interface with a DiscoverNodes method that returns labeled per-node info. NodeDiscovery is included in FullDiscovery so K8sWithGpuOperator implements all three facets. This is a pure refactor: no behavior change for existing callers. The new DiscoverNodes method is not yet consumed anywhere; it unblocks upcoming label-aware features (namespace-scoped limiter, node-aware bin packing). - Preserves WVA_NODE_SELECTOR handling, vendor iteration order, and multi-vendor node merging. - Labels in NodeInfo are an independent copy; mutation does not affect the underlying corev1.Node. - Adds 7 new unit tests for DiscoverNodes and keeps existing 9 tests unchanged as equivalence checks. - Updates mockFullDiscovery in pipeline tests to satisfy the new interface method.

ev-shindin · 2026-04-15T10:56:05Z

/ok-to-test

github-actions · 2026-04-15T10:56:16Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-15T10:56:20Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

github-actions · 2026-04-15T14:42:12Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	45	5

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

ev-shindin requested a review from lionelvillard April 15, 2026 09:13

ev-shindin self-assigned this Apr 15, 2026

ev-shindin linked an issue Apr 15, 2026 that may be closed by this pull request

Refactor K8sWithGpuOperator to expose per-node info via NodeDiscovery #1012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(discovery): add NodeDiscovery interface and consolidate node listing#1013

refactor(discovery): add NodeDiscovery interface and consolidate node listing#1013
ev-shindin wants to merge 1 commit intollm-d:mainfrom
ev-shindin:refactor/node-discovery

ev-shindin commented Apr 15, 2026

Uh oh!

ev-shindin commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ev-shindin commented Apr 15, 2026

Summary

Motivation

Behavior preservation

Tests

Verification

Non-goals

Follow-ups (out of scope)

Test plan

Uh oh!

ev-shindin commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

GPU Pre-flight Check ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant