Skip to content

Refactor K8sWithGpuOperator to expose per-node info via NodeDiscovery #1012

@ev-shindin

Description

@ev-shindin

Background

internal/discovery/k8s_with_gpu_operator.go currently has two near-identical vendor-loop node queries:

  • Discover() (lines 33-99) — queries nodes per vendor, returns map[nodeName]map[model]AcceleratorModelInfo
  • discoverNodeGPUTypes() (lines 147-188) — same vendor-loop, returns map[nodeName]string (just the model name)

Both loop over vendors, build the same label selector, list nodes, iterate results. They differ only in the projection of the result. Additionally, neither captures node.Labels, which will be needed by upcoming features (see "Motivation" below).

Motivation

Upcoming work needs a third projection of the same cluster state — per-node info including labels, not just accelerator models. Concrete use cases:

Adding a third independent vendor-loop for this would lock in the duplication pattern. This PR consolidates all per-node discovery into a single internal helper with multiple public projections.

Non-goals

  • No changes to CapacityDiscovery or UsageDiscovery interface signatures. Existing callers continue to work unchanged.
  • No new behavior. This is a pure refactor — same inputs produce the same outputs for existing public methods.
  • Not consuming DiscoverNodes anywhere yet. That happens in the namespace-scoped-limiter PR.

Behavior preservation checklist

The refactor must preserve:

  • WVA_NODE_SELECTOR environment variable handling
  • Vendor iteration order (nvidia.com, amd.com, intel.com)
  • Multi-vendor node handling (a node with both nvidia.com/gpu.product and amd.com/gpu.product-name labels should appear once in results with both accelerators)
  • node.Status.Allocatable for <vendor>/gpu used for Count
  • <vendor>/gpu.memory label used for Memory
  • Empty GPU count if Allocatable missing the resource
  • discoverNodeGPUTypes tie-breaking if a node has multiple vendor labels (preserve existing behavior; likely first-vendor-wins in the current loop order)

Acceptance criteria

  • New NodeInfo type in internal/discovery/types.go
  • New NodeDiscovery interface in internal/discovery/interface.go; added to FullDiscovery
  • listGPUNodes internal helper extracted; Discover and discoverNodeGPUTypes reimplemented as projections
  • New DiscoverNodes method on K8sWithGpuOperator
  • All existing tests in internal/discovery/*_test.go pass unchanged
  • New unit tests for DiscoverNodes covering:
    • Single-vendor node (NVIDIA)
    • Multi-vendor node (both NVIDIA and AMD labels) — one entry in result with both accelerators
    • Node labels captured correctly in NodeInfo.Labels
    • WVA_NODE_SELECTOR filters the node set
    • Node without GPU resources (labels present but Allocatable empty) handled gracefully
  • go build ./... clean, no linter issues
  • No changes outside internal/discovery/

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions