Skip to content

refactor(discovery): add NodeDiscovery interface and consolidate node listing#1013

Open
ev-shindin wants to merge 1 commit intollm-d:mainfrom
ev-shindin:refactor/node-discovery
Open

refactor(discovery): add NodeDiscovery interface and consolidate node listing#1013
ev-shindin wants to merge 1 commit intollm-d:mainfrom
ev-shindin:refactor/node-discovery

Conversation

@ev-shindin
Copy link
Copy Markdown
Collaborator

Summary

Pure refactor of internal/discovery/k8s_with_gpu_operator.go — no behavior change for existing callers. Extracts a single internal node-listing helper, adds a new NodeDiscovery interface that exposes per-node labels, and re-implements the existing Discover and discoverNodeGPUTypes methods as projections over the new helper.

This unblocks upcoming label-aware features (namespace-scoped limiter, node-aware bin packing) that need node.Labels alongside accelerator capacity, without forcing a third near-identical vendor-loop node query.

Motivation

K8sWithGpuOperator currently has two near-identical vendor-loop node queries:

  • Discover() (lines 33-99) — builds map[nodeName]map[model]AcceleratorModelInfo
  • discoverNodeGPUTypes() (lines 147-188) — builds map[nodeName]string

Both loop over vendors, build the same selector, list nodes, iterate results, and differ only in the projection. Upcoming work (namespace-scoped limiter, epic limiter improvements) needs a third projection — per-node info including node.Labels. Adding a third independent vendor-loop would lock in the duplication pattern. This PR consolidates node discovery into a single primitive with multiple public projections.

Behavior preservation

This is a pure refactor. Verified invariants:

  • Discover() output shape and contents unchanged — existing tests pass without modification
  • DiscoverUsage() unchanged — still uses discoverNodeGPUTypes internally
  • discoverNodeGPUTypes() multi-vendor tie-break preserved (intel > amd > nvidia if multiple labels present). Locked down by new TestDiscoverNodeGPUTypes_MultiVendorNode_LastWins.
  • WVA_NODE_SELECTOR handling unchanged; error on invalid selector is now explicitly tested
  • Vendor iteration order (nvidia.com, amd.com, intel.com) preserved
  • Multi-vendor nodes merged into a single NodeInfo entry with both accelerators
  • Nodes without Allocatable resource still included with Count=0
  • Non-GPU nodes excluded from DiscoverNodes output (same as Discover)

Tests

All existing discovery tests pass unchanged (behavior-equivalence check).

Eight new tests added:

  • TestDiscoverNodes_SingleVendor — basic labels + accelerators capture
  • TestDiscoverNodes_MultiVendorNode — merged single entry
  • TestDiscoverNodes_RespectsWVANodeSelector — sharding env var respected
  • TestDiscoverNodes_NodeWithGPULabelButNoAllocatableCount=0 preserved
  • TestDiscoverNodes_EmptyCluster — empty input → empty output
  • TestDiscoverNodes_ExcludesCPUOnlyNodes — filters consistent with Discover
  • TestDiscoverNodes_InvalidWVANodeSelectorReturnsError — error path tested for both DiscoverNodes and Discover
  • TestDiscoverNodes_LabelsAreIndependentCopy — mutation-safety contract
  • TestDiscoverNodeGPUTypes_MultiVendorNode_LastWins — behavior-preservation regression test

Verification

  • go build ./... — clean
  • go test ./internal/... — all packages pass (including envtest suites for actuator, controller, engines/saturation)
  • golangci-lint run ./internal/discovery/... ./internal/engines/pipeline/...0 issues

Non-goals

  • No new behavior for existing callers
  • DiscoverNodes not consumed anywhere yet; consumption happens in the follow-up namespace-scoped limiter PR
  • No changes outside internal/discovery/ and the pipeline test mock

Follow-ups (out of scope)

Test plan

  • go build ./... clean
  • go test ./internal/discovery/... — 18/18 pass
  • go test ./internal/engines/pipeline/... — all pass (mock updated)
  • go test ./internal/engines/saturation/... (envtest) — all pass
  • golangci-lint run ./internal/discovery/... ./internal/engines/pipeline/... — 0 issues
  • CI pipelines green on PR

… listing

Extracts a single internal helper (listGPUNodes) that queries GPU-bearing
nodes across all supported vendors (NVIDIA, AMD, Intel) and returns a
canonical per-node view. The existing Discover() and discoverNodeGPUTypes()
methods are re-implemented as thin projections over this helper, removing
near-identical vendor-loop duplication.

Adds a new NodeInfo type (Name, Labels, Accelerators) and a new
NodeDiscovery interface with a DiscoverNodes method that returns labeled
per-node info. NodeDiscovery is included in FullDiscovery so
K8sWithGpuOperator implements all three facets.

This is a pure refactor: no behavior change for existing callers. The
new DiscoverNodes method is not yet consumed anywhere; it unblocks
upcoming label-aware features (namespace-scoped limiter, node-aware
bin packing).

- Preserves WVA_NODE_SELECTOR handling, vendor iteration order, and
  multi-vendor node merging.
- Labels in NodeInfo are an independent copy; mutation does not affect
  the underlying corev1.Node.
- Adds 7 new unit tests for DiscoverNodes and keeps existing 9 tests
  unchanged as equivalence checks.
- Updates mockFullDiscovery in pipeline tests to satisfy the new
  interface method.
@ev-shindin ev-shindin self-assigned this Apr 15, 2026
@ev-shindin
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 45 5
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor K8sWithGpuOperator to expose per-node info via NodeDiscovery

1 participant