diff --git a/keps/prod-readiness/sig-node/5224.yaml b/keps/prod-readiness/sig-node/5224.yaml new file mode 100644 index 00000000000..b8892ed68b1 --- /dev/null +++ b/keps/prod-readiness/sig-node/5224.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 5224 +alpha: + approver: "johnbelamaric" diff --git a/keps/sig-node/5224-node-resource-discovery/README.md b/keps/sig-node/5224-node-resource-discovery/README.md new file mode 100644 index 00000000000..6083892aa93 --- /dev/null +++ b/keps/sig-node/5224-node-resource-discovery/README.md @@ -0,0 +1,1211 @@ + +# [KEP-5224](https://github.com/kubernetes/enhancements/issues/5224): Node Resource Discovery + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Story 4](#story-4) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [CRI API](#cri-api) + - [kubelet](#kubelet) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [CRI: non-zone-based API](#cri-non-zone-based-api) + - [Non-CRI: New plugin interface to Kubelet](#non-cri-new-plugin-interface-to-kubelet) + - [Non-CRI: Config file](#non-cri-config-file) + - [Non-CRI: Kubernetes API object](#non-cri-kubernetes-api-object) + - [Non-CRI: Some combination of the two above (config file and API object)](#non-cri-some-combination-of-the-two-above-config-file-and-api-object) + - [Non-CRI: Extend cAdvisor](#non-cri-extend-cadvisor) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This proposal seeks to a new method how the cpu and memory resources of a +Kubernetes node are discovered. Currently, the kubelet relies on +[cAdvisor][cadvisor] to detect the cpu and memory resources and their topology. +However, [cAdvisor] has limitations that prevent if from meeting the evolving +user requirements –– especially in complex and/or dynamic environments. + +The proposal aims to replace [cAdvisor][cadvisor]-based resource discovery with +a more flexible approach that covers the foreseeable usage scenarios, from +simple static node resources to external controllers managing the node +capacity. + +## Motivation + +The kubelet has a limited and non-dynamic view on the available resources on +the node. Determining the resource capacity available for Kubernetes is often +more involved than just listing all available devices enumerated by the +operating system kernel. For example, when partitioning the system between +Kubernetes and non-Kubernetes-managed workloads. + +The kubelet currently uses [cAdvisor][cadvisor] to discover the node capacity +and resource topology of native compute resources (cpu, memory and hugepages) +at kubelet startup. This setup has multiple shortcomings: + +- configurability: cAdvisor advertises all resources enumerated by the OS + kernel, it is not possible to partition the system, e.g. only give a subset + of CPUs (and/or memory) for Kubernetes +- incomplete topology information: cAdvisor data structure is not able to + correctly represent a variety of modern CPU architectures +- Incorrect representation of available compute: In the case of hot addition or + removal of compute resources, cAdvisor is incapable of determining the + updated capacity, leading to under/overutilization of resources. + +This proposal establishes a new mechanism for dynamically discovering and +updating the available resource capacity, to help mitigate the above +shortcomings of the cAdvisor based resource discovery. + +This is closely related to Node Resource Hot Plug ([KEP-3953][kep-3953]) which +opens a completely new territory of usage scenarios from green computing (with +fast scaling) to dynamically attached resources like CXL memory (Compute +Express Link). From a perspective of hypervisors that are widely available, a +majority of them support dynamic addition of compute resources such as +[CPU][linux-cpu-hotplug] and [Memory][linux-memory-hotplug]. One major +motivation for the proposal is to make it possible to implement custom +controllers to discover exotic hardware and/or dynamically adjust the node +resources. + +### Goals + +- Ability for kubelet to get node resources (capacity) from the CRI runtime +- Retain current functionality of cpu, memory and topology managers +- API that can support dynamic node capacity changes + +### Non-Goals + +- Implement dynamic updates of node capacity (covered by [KEP-3953][kep-3953]) +- Change existing behavior on node capacity changes (covered e.g. [KEP-3953][kep-3953]) +- Ephemeral storage + +## Proposal + +This proposal suggests the CRI runtime as the source of truth for resource +capacity. It has visibility and control over all running containers and the +resources in the host. Also, the runtime could be used to serve other clients +(e.g. docker) in addition to Kubernetes. CRI runtimes (containerd and cri-o) +have a plugin mechanism, NRI, which can be extended to provide additional +service to report resources, and external resource discovery can then be +deployed as an [NRI][nri] plugin. + +### User Stories (Optional) + +#### Story 1 + +As a system administrator I want to partition my node between system, +non-Kubernetes-managed workloads and Kubernetes-managed workloads. + +#### Story 2 + +As a cluster operator I want to hide the true hardware details, e.g. actual +number of logical CPU cores from the Kubernetes. I want to do this to optimize +application performance by taking advantage of the detailed knowledge of how +the hardware works behind the scenes. + + +#### Story 3 + +As a system administrator, I want my cluster to be aware of the compute +available throughout its lifespan and improve scheduling based on the node’s +compute capabilities. + +#### Story 4 + +As a system administrator, I want the cluster to gain insights of the host’s +architecture specific capabilities and leverage the same for improved +performance. + +### Notes/Constraints/Caveats (Optional) + + + +Reported capacity does not fit currently running pods. This is more of a +concern in of [KEP-3953][kep-3953]. In practice, the kubelet/Kubernetes cannot +prevent resources changing (e.g. CPUs getting offlined). + +### Risks and Mitigations + + + +## Design Details + +### CRI API + +A new streaming RPC is added to get the node resources. + +The proposed API for exposing resources takes inspiration from +[NodeResourceTopology API][noderesourcetopology-api]. It allows +representing the resources in a tree structure representing the hardware +topology of the system. + +```proto +import "k8s.io/apimachinery/pkg/api/resource/generated.proto"; + +// GetDynamicRuntimeConfig is a streaming interface for receiving dynamically +// changing runtime and node configuration +rpc GetDynamicRuntimeConfig(DynamicRuntimeConfigRequest) returns (stream DynamicRuntimeConfigResponse) {} + +message DynamicRuntimeConfigRequest{} + +message DynamicRuntimeConfigResponse { + ResourceTopology resource_topology = 1; +} + +message ResourceTopology { + repeated ResourceTopologyZone zones = 1; +} + +// ResourceTopologyZone represents a node in the topology tree +message ResourceTopologyZone { + string name = 1; + string type = 2; + string parent = 3; + repeated ResourceTopologyCost costs = 4; + map attributes = 5; + repeated ResourceTopologyResourceInfo resources = 6; +} + +message ResourceTopologyCost { + string name = 1; + uint32 value = 2; +} + +message ResourceTopologyResourceInfo { + string name = 1; + k8s.io.apimachinery.pkg.api.resource.Quantity capacity = 2; +} +``` + +The well-known zone types and attributes that kubelet understands are defined +as consts. + +```go +const ( + ResourceTopologyZoneCore = "Core" + ResourceTopologyZoneCacheGroup = "CacheGroup" + ResourceTopologyZonePackage = "Package" + ResourceTopologyZoneNUMANode = "NUMANode" +) + +const ( + // Attribute name for the CPU IDs of a zone that contains CPU resources + // in “cpuset” format. + ResourceTopologyAttributeCPUIDs = "cpu-ids" + + // Attribute names to identify a machine. Used to replace the + // MachineID, BootID and SystemUUID fields of the cAdvisor + // MachineInfo. + ResourceTopologyAttributeMachineID = "machine-id" + ResourceTopologyAttributeBootID = "boot-id" + ResourceTopologyAttributeSystemUUID = "system-uuid" +) +``` + +Consider the following simple (and partial) example topology: + +```mermaid +flowchart TD + A["system"] + A --> B["Socket 1"] + A --> C["Socket 2"] + B --> D["Core 1"] + B --> E["Core 2"] +``` + +The above could be turned into the following resource topology tree: + +```yaml +zones: + - name: "socket-1" + type: "socket" + parent: "root" + - name: "core-1" + type: "core" + parent: "socket-1" + resources: + - name: "cpu" + capacity: 1 + - name: "core-2" + type: "core" + parent: "socket-1" + resources: + - name: "cpu" + capacity: 1 + - name: "socket-2" + type: "socket" + parent: "root" +... +``` + +### kubelet + +The kubelet is updated to query the node resources via the new +GetDynamicRuntimeConfig CRI endpoint. The [cAdvisor][cadvisor] MachineInfo is +fully replaced with the information received from the CRI. The node capacity is +queried at kubelet startup and the Node object on apiserver is only updated at +this time. A separate enhancement proposal ([KEP-3953][kep-3953]) concentrates +on runtime dynamic updates of resources. + +The kubelet keeps the streaming channel open and logs dynamic changes. However, +the kubelet does not otherwise react to the dynamic changes after startup – +dynamic node resize will be covered by [KEP-3953][kep-3953]. Node events about +capacity changes will be generated when dynamic node resize +([KEP-3953][kep-3953]) is enabled. + +The following diagram illustrates the kubelet resource discovery flow: + +```mermaid +sequenceDiagram + participant k as kubelet + participant r as runtime + actor c as system/controller + r->c: detect resources + k->>r: open streaming interface + r->>k: send available resources + k->>k: update the node object, finish startup + Note over k, r: at startup + c->>r: resources change + r->>k: send updated resources to the kubelet + opt NodeResourceHotPlug=true + k->>k: re-initialize managers, update the node object + end +``` + +The handling of system-reserved and kube-reserved resources is not changed by +this proposal. + +There is a fallback to cadvisor-based resource discovery even when the feature +gate is enabled: if the CRI runtime does not support the new +DynamicRuntimeConfigRequest rpc, the kubelet will resort to cadvisor. This will +ensure compatibility with older versions of CRI runtimes. + +If the kubelet is not able to parse or consume the resource topology tree it +will set the node into NotReady state, with node status, events (and kubelet +logs) indicating the details. + +If the cpu manager static policy is active, the kubelet requires details of the +hardware topology. It recognizes the well-known zone types (“NUMANode”, +“Package”, “Core”, “CacheGroup”) to construct the HW topology to initialize the +cpu manager, memory manager and topology manager. The cpu manager static policy +requires information about CPU IDs and thus the kubelet requires the “cpu-ids” +attribute to be present for zones containing cpu resources. If the cpu manager +static policy is active and cpu-ids are not available the node is set into +NotReady state. If the “none” policy is in use, cpu-ids are not required. + +The kubelet reads MachineID, BootID and SystemUUID from the attributes of the +resource topology tree. If this information is not present the kubelet uses the +cAdvisor MachineInfo as a fallback. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- `k8s.io/kubernetes/pkg/kubelet`: `2025-05-19` - `71.5%` +- `k8s.io/kubernetes/pkg/kubelet/cadvisor`: `2025-05-19` - `30.8%` + +##### Integration tests + + + + + +For alpha, no new integration tests are planned. The proposal is node-local and +only touches the kubelet. + +##### e2e tests + + + +For alpha, no new e2e tests are planned. + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind a feature flag +- Fallback behavior: fall back to cadvisor-based resource discovery if feature + gate is enabled but the CRI runtime does not support this feature +- Kubelet unit tests implemented and enabled + +#### Beta + +- Feature flag enabled by default +- Released versions of CRI runtime implementations (containerd and CRI-O) + support the feature +- Drop support for automatic fallback for cadvisor-based resource discovery + (only use cadvisor if the feature gate is disabled) + +#### GA + +- No bugs reported in the previous cycle +- Feature gate removed, feature is always enabled +- Drop support for cadvisor-based resource discovery (no fallback available) + +### Upgrade / Downgrade Strategy + + + +No specific requirements for kubernetes, other than enabling/disabling the +kubelet feature is required. Fallback behavior (in alpha) ensures compatibility +with older CRI runtimes that do not support the feature. + +### Version Skew Strategy + + + +No other Kubernetes components than kubelet are affected by this feature so +there is no version skew between them. + +If the feature is enabled (in alpha) and the container runtime does not support +it, kubelet resorts to the existing behavior of discovering resources via +cadvisor. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: KubeletResourceDiscoveryFromCRI + - Components depending on the feature gate: kubelet + +###### Does enabling the feature change any default behavior? + + + +No. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes. + +Disabling the feature may change the node capacity (cpu, memory, hugepages) as +the resources reported by cAdvisor may be different than what the CRI runtime +reports. If the capacity is decreased, existing pods may be evicted from the +node. + +###### What happens if we reenable the feature if it was previously rolled back? + +Similar to disabling the feature (above), reenabling it may cause the node +capacity to change. This, in turn, may cause existing pods to be evicted from +the node. + +###### Are there any tests for feature enablement/disablement? + + + +TBD. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +A rollout could fail e.g. because of a bug in the CRI runtime, the runtime +returning data that the kubelet cannot consume. In this case the node will be +set into NotReady state. Existing workloads should not be affected but new pods +cannot be scheduled on the node. + +There are no identified error paths in rollback (to cAdvisor-based resource discovery). + +###### What specific metrics should inform a rollback? + + + +Nodes getting into NotReady state. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +TBD. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +No. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +TDB. The feature is applies to an entire node so essentially all pods running +on the node are under its influence. + +###### How can someone using this feature know that it is working for their instance? + + + +TBD. + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +TBD. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +TBD. + +- [ ] Metrics + - Metric name: `cri_dynamic_runtime_config_responses_total` + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +TBD. + +- `cri_dynamic_runtime_config_responses_total`: indicating "resize" events from the CRI runtime + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +Yes. This feature depends on a version of the CRI runtime that implements the features. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +If Node Resource Hotplug ([KEP-3953][kep-3953]) is enabled in tandem, the node +object (status.capacity) is updated if the node resources change. However, +these events should be infrequent and have negligible impact on the API server. + +CRI API: yes. There will be a new gRPC streaming API through which kubelet +receives resources. + +###### Will enabling / using this feature result in introducing new API types? + + + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +No. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +Not directly. A catastrophic bug in the CRI runtime code that implements the +feature could have unforeseen effects. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +No effect. The feature is node-local between kubelet and the CRI runtime. + +###### What are other known failure modes? + + + +TBD. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +TBD. + +## Implementation History + + + +## Drawbacks + + + +This enhancement puts the burden of resource discovery on the CRI runtime +codebase. This will be mitigated by sharing resource discovery the code between +the CRI runtimes. + +## Alternatives + + + +The sections below outline multiple alternative approaches. One outlining a +different approach to CRI API design and the others exploring alternatives to +CRI API. The data types inside the API could be virtually unchanged (sick with +the zone-based approach) but the mechanism for delivering the data would be +different. + +### CRI: non-zone-based API + +```proto +message ResourceTopology { + repeated ResourceCpuInfo cpu_info = 1; + repeated ResourceNumaNodeInfo numa_node_info = 2; + ResourceSwapInfo swap_info = 3; +} + +// ResourceCpuInfo provides information about on logical CPU +message ResourceCpuInfo { + int64 id = 1; + int64 core_id = 2; + // cpu_group marks desired cpu grouping between socket (package) and core levels, + // abstraction for the “uncore cache” level + int64 cpu_group_id = 3; + int64 socket_id = 4; +} + +message ResourceNumaNodeInfo { + int64 id = 1; + // resources is a list of resources that are local to this numa node, + // expected to contain only memory or hugepages, other resources are ignored + repeated ResourceTopologyResourceInfo resources = 2; + // distance is the distance vector, i.e. the distance from this node to all + // other nodes + repeated int64 distance = 3; + // cpu_ids is a list of logical CPU IDs that are local to this numa node + repeated int64 cpu_ids = 4; +} + +message ResourceSwapInfo { + k8s.io.apimachinery.pkg.api.resource.Quantity capacity = 1; +} + +message ResourceTopologyResourceInfo { + string name = 1; + k8s.io.apimachinery.pkg.api.resource.Quantity capacity = 2; +} +``` + +Pros: + +- simpler implementation +- simpler validation + +Cons: + +- less flexible +- new hw features may need changing the API + +### Non-CRI: New plugin interface to Kubelet + +One possible solution would be to add a new (gRPC) interface, following the +pattern of the device plugin and podresources APIs. Kubelet would expose a +separate socket for this API where the resource discovery plugin could connect +to. The API could be virtually identical to the CRI one proposed above. + +Pros: + +- controllers running natively on host are easily supported + +Cons: + +- yet another socket and gRPC API and plugin type in kubelet + +### Non-CRI: Config file + +(inspired by @tallclair) + +A separate configuration file on the host filesystem would be used to +communicate the node resources (and topology). The file would be created by the +node provisioner and read by the kubelet at startup. On top of this, kubelet +could place a watch on the file and update node resources whenever the file is +updated. + +Pros: + +- simple + +Cons: + +- more susceptible to syntax errors, unintended tampering or other corruption +- risk of stale, out-of-date files +- bootstrapping is harder and more prone to problems (somebody needs to create the file) +- for dynamic updates + - care must be taken to prevent race conditions + - gRPC is already used elsewhere so why use filesystem here + +### Non-CRI: Kubernetes API object + +(inspired by @tallclair) + +Use the Node object (or some other API resource) to communicate the available +resources to the kubelet. A dynamic controller would update the API object and +the kubelet would respond to that change. + +Pros: + +- the “controller paradigm” + +Cons: + +- Bootstrapping, how to get started when no Node object exists, in cluster + bootstrapping even the apiserver is not yet available +- The controller (possibly running on the host os/system directly) needs access + to the Kubernetes API +- With new K8s API resource, increased implementation/maintenance cost and + version skew/upgrade/downgrade checks + + +### Non-CRI: Some combination of the two above (config file and API object) + +(inspired by @tallclair) + +E.g. use (static) config file for bootstrap and/or startup and then plugin +interface or Kubernetes API for updates. + +Pros: + +- bootstrapping is better defined than with socket interface or Kubernetes API alone +Cons: + +- two apis: “static” and “dynamic” +- cons from the other above + +### Non-CRI: Extend cAdvisor + +The cAdvisor could be made pluggable (in similar fashion to the new plugin +interface or config file options for kubelet above). However, this seems +complicated for no reason, cAdvisor acting as an unnecessary middle man. The +cAdvisor also has a lot of functionality that the kubelet does not need, +suggesting its removal in the future (totally outside of the scope of this +proposal). + + +An extreme hack would be to use cadvisor (almost) as is but point it to a fake +sysfs to control the node resources. This feels very complex and error prone +when compared to the other alternatives. + + +Pros: + +- minimal changes in kubelet + +Cons: + +- complicated, kludge'ish +- cAdvisor is used as library, introducing plugin mechanism inside that library + might be big surprises to other projects that consume cAdvisor. + +## Infrastructure Needed (Optional) + + + + +[cadvisor]: https://github.com/google/cadvisor +[kep-3953]: https://github.com/kubernetes/enhancements/issues/3953 +[linux-cpu-hotplug]: https://www.kernel.org/doc/html/v6.14/core-api/cpu_hotplug.html +[linux-memory-hotplug]: https://docs.kernel.org/admin-guide/mm/memory-hotplug.html +[noderesourcetopology-api]: https://github.com/k8stopologyawareschedwg/noderesourcetopology-api +[nri]: https://github.com/containerd/nri diff --git a/keps/sig-node/5224-node-resource-discovery/kep.yaml b/keps/sig-node/5224-node-resource-discovery/kep.yaml new file mode 100644 index 00000000000..b787ca3e544 --- /dev/null +++ b/keps/sig-node/5224-node-resource-discovery/kep.yaml @@ -0,0 +1,43 @@ +title: Node Resource Discovery +kep-number: 5224 +authors: + - "@marquiz" +owning-sig: sig-node +status: implementable +creation-date: 2025-05-06 +reviewers: + - TBD +approvers: + - TBD + +see-also: + - "/keps/sig-node/3953-node-resource-hot-plug" + +replaces: [] + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.34" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.34" + beta: "" + stable: "" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: KubeletCRIResourceDiscovery + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: []