diff --git a/keps/prod-readiness/sig-node/5322.yaml b/keps/prod-readiness/sig-node/5322.yaml new file mode 100644 index 00000000000..2a00193b513 --- /dev/null +++ b/keps/prod-readiness/sig-node/5322.yaml @@ -0,0 +1,3 @@ +kep-number: 5322 +alpha: + approver: "@jpbetz" diff --git a/keps/sig-node/5322-dra-driver-permanent-failure/README.md b/keps/sig-node/5322-dra-driver-permanent-failure/README.md new file mode 100644 index 00000000000..78baf143ed9 --- /dev/null +++ b/keps/sig-node/5322-dra-driver-permanent-failure/README.md @@ -0,0 +1,1070 @@ + +# KEP-5322: DRA: Handle permanent driver failures + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Efficiency](#efficiency) + - [Visibility of Errors](#visibility-of-errors) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API](#api) + - [Pod Admission Failure](#pod-admission-failure) + - [Preventing Repeated Failures](#preventing-repeated-failures) + - [Diagnosing Permanent Errors](#diagnosing-permanent-errors) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +For Dynamic Resource Allocation (DRA), the kubelet interfaces with a separate +driver component via gRPC which is responsible for attaching devices to +containers based on scheduler outcomes. When drivers indicate to the kubelet +that a failure occurred during that process, the kubelet will try again later. +This strategy enables the kubelet to overcome transient failures in drivers, but +is wasteful when the error is deterministic based on unchanged inputs. + +This KEP proposes additions to the gRPC interface between the kubelet and DRA +drivers to enable drivers to report permanent failures, and updates to the +kubelet to respond to those errors by ceasing to continuously retry invoking the +DRA driver. + +## Motivation + + + +Several failure modes of the DRA driver's `NodePrepareResources` gRPC method are +not possible to resolve by trying again with the same input the way the kubelet +currently handles all failures: + +- The opaque `config` associated with a request in a ResourceClaim may be + invalid +- A device allocated by the scheduler may have just been found by the driver to + be unusable + +Pods with unfulfillable DRA allocations will stay stuck in a non-erroneous +pending state until manual intervention is taken to identify the cause for the +lack of progress and ultimately delete and recreate the Pod. In the meantime, +the kubelet will waste time retrying an operation that is known to fail the +same way as it did previously. Making the permanent nature of the error known as +soon as possible allows the quickest path for remediation. + +### Goals + + + +- Minimize the amount of unnecessary work done by the kubelet and DRA drivers. +- Enable workloads to more responsively reschedule Pods in a permanent failure + state. + +### Non-Goals + + + +- Automatically attempt to mitigate permanent errors, such as by deleting or + rescheduling Pods. That should be left to higher-level controllers. + +## Proposal + + + +### User Stories + + + +#### Efficiency + +As a cluster administrator, I want to minimize the amount of unnecessary work +done by critical components like the kubelet and DRA drivers to maximize their +availability for more important work. + +#### Visibility of Errors + +As a workload administrator, I want to ensure that my workloads are able to +start up as quickly and reliably as possible by proactively rescheduling Pods +when their allocated DRA resources cannot be fulfilled. + + + +### Risks and Mitigations + + + +By retrying in all error cases, the current implementation is resilient to +transient errors. Adding a path that does not retry opens up the possibility +for miscategorized errors to cause a Pod to terminally fail even when a +subsequent attempt might allow the Pod's startup to progress. + +## Design Details + + + +Broadly, two changes are required to enable permanent errors to skip being +retried: + +1. Allow DRA drivers to annotate errors encountered during the + `NodePrepareResources` gRPC call as being permanent. +1. Update the kubelet to skip retrying to prepare allocated resources for a Pod when a + previous request for that Pod failed permanently. + +### API + +The kubelet's DRA gRPC interface will be updated to allow drivers to +classify errors with a new `error_type` field: + +```proto +message NodePrepareResourceResponse { + // These are the additional devices that kubelet must + // make available via the container runtime. A claim + // may have zero or more requests and each request + // may have zero or more devices. + repeated Device devices = 1; + // If non-empty, preparing the ResourceClaim failed. + // Devices are ignored in that case. + string error = 2; + // When a non-empty error is returned, indicates the + // type of error that occurred. + NodePrepareResourceResponseErrorType error_type = 3; +} + +enum NodePrepareResourceResponseErrorType { + // Unknown errors are unclassified. + Unknown = 0; + // Permanent errors are expected to fail consistently + // for a given request and should not be retried. + Permanent = 1; +} +``` + +Permanent errors are ones which occur consistently for a given input to the DRA +driver's `NodePrepareResources` gPRC call. Errors not classified as permanent +are expected to be transient and eventually resolve. Permanent errors may be +caused by faulty, irrecoverable devices or by invalid opaque `config` associated +with a ResourceClaim's request. + +### Pod Admission Failure + +Once the kubelet receives a permanent error from a DRA driver for a given Pod, +it must take the appropriate action to give up on the Pod as permanently as possible. More +concretely, the kubelet will set the Pod's `status.phase` to `Failed`. The +kubelet already knows not to continue reconciling `Failed` Pods and many other +Pod controllers already handle `Failed` Pods. + +In some cases, the kubelet will re-reconcile Pods in a `Failed` phase. DRA +drivers should not depend on requests resulting in permanent errors to never be +made again. Similarly, DRA drivers should still expect the kubelet to invoke the +`NodeUnprepareResources` method even after a prior `NodePrepareResources` call +returned a permanent error. + +### Preventing Repeated Failures + +If a higher-level Pod controller replaces a Pod in the `Failed` phase with a new +Pod because of a permanent failure from a DRA driver, steps should be taken to +prevent the replacement Pod from being allocated the same device which is +expected to fail again the same way. + +When [device taints and tolerations](https://kep.k8s.io/5055) are available, DRA +drivers should consider adding a taint for the device with the `NoSchedule` +effect if it determines other requests for that device are also likely to fail. +When device taints are not available, DRA drivers may instead consider removing +a device from the ResourceSlice to remove it from the pool. + +When a permanent error is caused by invalid configuration in the ResourceClaim's +request and not by a faulty device, drivers should not taint the device if another +request for the same device with different parameters is expected to succeed. + +### Diagnosing Permanent Errors + +The kubelet already logs errors returned from DRA drivers' +`NodePrepareResources` responses and will generate an Event referring to the +affected Pods. These messages will be updated to indicate when the error is +permanent. + +### Test Plan + + + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- `k8s.io/kubernetes/pkg/kubelet`: `2025-09-29` - `72.6%` +- `k8s.io/kubernetes/pkg/kubelet/cm/dra`: `2025-09-29` - `81.7%` +- `k8s.io/kubernetes/pkg/kubelet/kuberuntime`: `2025-09-29` - `69.5%` + +##### Integration tests + + + + + +These changes are focused on interactions between third-party DRA drivers and +the kubelet. Since integration tests generally focus on interactions with etcd +and the API server, integration tests for this feature will not be added. + +##### e2e tests + + + +E2E tests will be added to verify that Pods referencing ResourceClaims which +receive permanent errors from DRA drivers transition to the `Failed` `status.phase`. + +Likewise, Pods referencing ResourceClaims which cause drivers to fail transiently +should remain in the `Pending` phase until the transient error is overcome and +the Pod reaches the `Running` phase. + +Other potential E2E test scenarios include: +- Ensuring a Pod referencing a ResourceClaim that encountered a permanent error + can be deleted when different DRA driver instances handle the + `NodePrepareResources` and `NodeUnprepareResources` calls. +- Ensuring that Pods experiencing permanent DRA errors do not interfere with the + cleanup of CSI resources. +- Ensuring the kubelet leaks no resources when many Pods encounter permanent DRA + errors. + +### Graduation Criteria + + + +#### Alpha + +- Feature implemented behind a feature flag +- Initial e2e tests completed and enabled + +#### Beta + +- Gather feedback from developers and surveys +- Additional tests are in Testgrid and linked in KEP + +#### GA + +- 3 examples of real-world usage +- Allowing time for feedback +- All issues and gaps identified as feedback during beta are resolved + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +When the kubelet is a newer version which implements this KEP and a DRA driver +is an older version whose gRPC interface does not yet include the +`error_type` field, the kubelet will observe that all errors coming from a +DRA driver's `NodePrepareResources` response are transient. This matches the +current behavior where the kubelet does not yet implement this KEP. + +When the inverse is true and a DRA driver responds to `NodePrepareResources` +with a permanent error to a kubelet that does not implement this KEP, then the +error will be treated as a transient error. DRA drivers therefore should expect +that identical requests may be made even when a permanent error was given in a +prior response. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: DRAHandlePermanentDriverFailures + - Components depending on the feature gate: + - kubelet + +###### Does enabling the feature change any default behavior? + + + +No + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes, setting the feature gate to `false` will immediately cause all errors from +DRA drivers to be handled by the kubelet the same way as they were previously. +DRA drivers should not need to be aware of whether or not the kubelet implements +or enables this feature in order to behave in a correct way. + +###### What happens if we reenable the feature if it was previously rolled back? + +When the feature is reenabled, new errors that are classified as permanent will +be handled the same way as if the feature were enabled initially. This feature +does not require any persisted state. + +###### Are there any tests for feature enablement/disablement? + + + +This will be covered with unit tests for the kubelet. + +### Rollout, Upgrade and Rollback Planning + + + +Will be considered for beta. + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +Will be considered for beta. + +###### What specific metrics should inform a rollback? + + + +Will be considered for beta. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +Will be considered for beta. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +Will be considered for beta. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +Will be considered for beta. + +###### How can someone using this feature know that it is working for their instance? + + + +Will be considered for beta. + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +Will be considered for beta. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +Will be considered for beta. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +Will be considered for beta. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +Will be considered for beta. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +Depending on the exact implementation, permanent errors will result in at most +one extra API call per Pod to PATCH pods/status. + +###### Will enabling / using this feature result in introducing new API types? + + + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +No + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +No. Startup latency is not affected because Pods affected by this feature never +finish starting up. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +No + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +No + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +Will be considered for beta. + +###### What are other known failure modes? + + + +Will be considered for beta. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +Will be considered for beta. + +## Implementation History + + + +- 1.35: Initial proposal and implementation + +## Drawbacks + + + +While this feature makes it more convenient to determine when a Pod needs to be +rescheduled, it does not unlock any brand new use cases. It is already possible +for a higher-level controller to determine that a Pod taking too much time to +start up and act accordingly. The complexity of this change might outweigh the +benefits. + +## Alternatives + + + + diff --git a/keps/sig-node/5322-dra-driver-permanent-failure/kep.yaml b/keps/sig-node/5322-dra-driver-permanent-failure/kep.yaml new file mode 100644 index 00000000000..466a00d3b59 --- /dev/null +++ b/keps/sig-node/5322-dra-driver-permanent-failure/kep.yaml @@ -0,0 +1,35 @@ +title: "DRA: Handle permanent driver failures" +kep-number: 5322 +authors: + - "@nojnhuh" +owning-sig: sig-node +participating-sigs: [] +status: implementable +creation-date: 2025-09-19 +reviewers: + - "@pohly" + - "@SergeyKanzhelev" +approvers: + - "@mrunalp" + +see-also: + - "/keps/sig-scheduling/5055-dra-device-taints-and-tolerations" +replaces: [] + +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.35" + +milestone: + alpha: "v1.35" + +feature-gates: + - name: DRAHandlePermanentDriverFailures + components: + - kubelet +disable-supported: true + +metrics: []