refactor(discovery): add NodeDiscovery interface and consolidate node listing#1013
Open
ev-shindin wants to merge 1 commit intollm-d:mainfrom
Open
refactor(discovery): add NodeDiscovery interface and consolidate node listing#1013ev-shindin wants to merge 1 commit intollm-d:mainfrom
ev-shindin wants to merge 1 commit intollm-d:mainfrom
Conversation
… listing Extracts a single internal helper (listGPUNodes) that queries GPU-bearing nodes across all supported vendors (NVIDIA, AMD, Intel) and returns a canonical per-node view. The existing Discover() and discoverNodeGPUTypes() methods are re-implemented as thin projections over this helper, removing near-identical vendor-loop duplication. Adds a new NodeInfo type (Name, Labels, Accelerators) and a new NodeDiscovery interface with a DiscoverNodes method that returns labeled per-node info. NodeDiscovery is included in FullDiscovery so K8sWithGpuOperator implements all three facets. This is a pure refactor: no behavior change for existing callers. The new DiscoverNodes method is not yet consumed anywhere; it unblocks upcoming label-aware features (namespace-scoped limiter, node-aware bin packing). - Preserves WVA_NODE_SELECTOR handling, vendor iteration order, and multi-vendor node merging. - Labels in NodeInfo are an independent copy; mutation does not affect the underlying corev1.Node. - Adds 7 new unit tests for DiscoverNodes and keeps existing 9 tests unchanged as equivalence checks. - Updates mockFullDiscovery in pipeline tests to satisfy the new interface method.
Collaborator
Author
|
/ok-to-test |
Contributor
|
🚀 Kind E2E (full) triggered by |
Contributor
|
🚀 OpenShift E2E — approve and run ( |
Contributor
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pure refactor of
internal/discovery/k8s_with_gpu_operator.go— no behavior change for existing callers. Extracts a single internal node-listing helper, adds a newNodeDiscoveryinterface that exposes per-node labels, and re-implements the existingDiscoveranddiscoverNodeGPUTypesmethods as projections over the new helper.This unblocks upcoming label-aware features (namespace-scoped limiter, node-aware bin packing) that need
node.Labelsalongside accelerator capacity, without forcing a third near-identical vendor-loop node query.Motivation
K8sWithGpuOperatorcurrently has two near-identical vendor-loop node queries:Discover()(lines 33-99) — buildsmap[nodeName]map[model]AcceleratorModelInfodiscoverNodeGPUTypes()(lines 147-188) — buildsmap[nodeName]stringBoth loop over
vendors, build the same selector, list nodes, iterate results, and differ only in the projection. Upcoming work (namespace-scoped limiter, epic limiter improvements) needs a third projection — per-node info includingnode.Labels. Adding a third independent vendor-loop would lock in the duplication pattern. This PR consolidates node discovery into a single primitive with multiple public projections.Behavior preservation
This is a pure refactor. Verified invariants:
Discover()output shape and contents unchanged — existing tests pass without modificationDiscoverUsage()unchanged — still usesdiscoverNodeGPUTypesinternallydiscoverNodeGPUTypes()multi-vendor tie-break preserved (intel > amd > nvidia if multiple labels present). Locked down by newTestDiscoverNodeGPUTypes_MultiVendorNode_LastWins.WVA_NODE_SELECTORhandling unchanged; error on invalid selector is now explicitly testednvidia.com,amd.com,intel.com) preservedNodeInfoentry with both acceleratorsAllocatableresource still included withCount=0DiscoverNodesoutput (same asDiscover)Tests
All existing discovery tests pass unchanged (behavior-equivalence check).
Eight new tests added:
TestDiscoverNodes_SingleVendor— basic labels + accelerators captureTestDiscoverNodes_MultiVendorNode— merged single entryTestDiscoverNodes_RespectsWVANodeSelector— sharding env var respectedTestDiscoverNodes_NodeWithGPULabelButNoAllocatable—Count=0preservedTestDiscoverNodes_EmptyCluster— empty input → empty outputTestDiscoverNodes_ExcludesCPUOnlyNodes— filters consistent withDiscoverTestDiscoverNodes_InvalidWVANodeSelectorReturnsError— error path tested for bothDiscoverNodesandDiscoverTestDiscoverNodes_LabelsAreIndependentCopy— mutation-safety contractTestDiscoverNodeGPUTypes_MultiVendorNode_LastWins— behavior-preservation regression testVerification
go build ./...— cleango test ./internal/...— all packages pass (including envtest suites for actuator, controller, engines/saturation)golangci-lint run ./internal/discovery/... ./internal/engines/pipeline/...—0 issuesNon-goals
DiscoverNodesnot consumed anywhere yet; consumption happens in the follow-up namespace-scoped limiter PRinternal/discovery/and the pipeline test mockFollow-ups (out of scope)
DiscoverNodesTest plan
go build ./...cleango test ./internal/discovery/...— 18/18 passgo test ./internal/engines/pipeline/...— all pass (mock updated)go test ./internal/engines/saturation/...(envtest) — all passgolangci-lint run ./internal/discovery/... ./internal/engines/pipeline/...— 0 issues