diff --git a/keps/sig-network/4974-deprecate-endpoints/README.md b/keps/sig-network/4974-deprecate-endpoints/README.md new file mode 100644 index 00000000000..56a4b783529 --- /dev/null +++ b/keps/sig-network/4974-deprecate-endpoints/README.md @@ -0,0 +1,532 @@ +# KEP-4974: Deprecate v1.Endpoints + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [API Server Warnings](#api-server-warnings) + - [Endpoints Cleanup?](#endpoints-cleanup) + - [E2E Test Updates](#e2e-test-updates) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +The `v1.Endpoints` API has been essentially deprecated since +EndpointSlices became GA in 1.21. Several new Service features (such +as dual-stack and topology, not to mention "services with more than +1000 endpoints") are implemented only for EndpointSlice, not for +Endpoints. Kube-proxy no longer uses Endpoints ever, for anything, and +the Gateway API conformance tests also require implementations to use +EndpointSlices. + +Despite this, kube-controller-manager still does all of the work of +managing Endpoints objects for all Services, and a cluster cannot pass +the conformance test suite unless the Endpoints and EndpointSlice +Mirroring controllers are running, even though in many cases nothing +will ever look at the output of the Endpoints controller (and the +EndpointSlice Mirroring controller will never output anything). + +While Kubernetes's API guarantees make it essentially impossible to +ever actually fully remove Endpoints, we should at least move toward a +world where most users run Kubernetes with the Endpoints and +EndpointSlice Mirroring controllers disabled. + +## Motivation + +### Goals + +- Ensure that all documentation reflects the fact that the Endpoints + API is deprecated and should not be used by new code, and that + Endpoints objects might not be created or mirrored in some clusters. + +- Add apiserver warnings when users (other than the Endpoints + controller) create Endpoints objects. + +- Update the e2e test suite to ensure that all tests that care about + endpoints only as a generic concept are using EndpointSlices rather + than Endpoints. (For example, the utility function + `framework.WaitForServiceEndpointsNum` should check that + EndpointSlices exist, not that Endpoints exist). + +- Adjust the conformance test suite to not require the Endpoints and + EndpointSlice Mirroring controllers to be running: + + - Fix tests such as `Service endpoints latency should not be very + high` to check the latency of EndpointSlices rather than + Endpoints, since that is what is actually important in + real-world usage. + + - Demote tests such as `EndpointSliceMirroring should mirror a + custom Endpoints resource through create update and delete` to + non-conformance. + + - Split up tests such as `EndpointSlice should create Endpoints + and EndpointSlices for Pods matching a Service` into a + conformance test for EndpointSlice and a non-conformance test + for Endpoints. + +- Update the e2e test suite so that all remaining tests that depend on + the outputs of the Endpoints or EndpointSlice Mirroring controllers + are feature-tagged so they can be skipped in clusters that do not + run those controllers, and set up a periodic e2e job running in such + a configuration. + +- Explicitly document that disabling `endpoints-controller` and/or + `endpointslice-mirroring-controller` via kube-controller-manager's + `--controllers` flag is a supported and conforming configuration. + + - Figure out what to do with stale Endpoints in such a cluster. + +``` +<<[UNRESOLVED kubernetes.default ]>> + +- MAYBE change kube-apiserver to optionally not generate Endpoints for + kubernetes.default, though this would require adding a + kube-apiserver configuration option, and the benefit from removing + just that 1 Endpoints object is small. Perhaps instead it could just + add an annotation to the object pointing out the fact that Endpoints + is deprecated. + +<<[/UNRESOLVED]>> +``` + +``` +<<[UNRESOLVED disable-by-default ]>> + +- MAYBE eventually move `endpoints-controller` and + `endpointslice-mirroring-controller` out of the "on-by-default" + controller list so that they become opt-in rather than opt-out. + +- MAYBE eventually switch our e2e test jobs to disable those + controllers by default, and only run a single periodic/optional job + to run the Endpoints-specific tests. + +<<[/UNRESOLVED]>> +``` + +### Non-Goals + +- Deleting or modifying the `v1.Endpoints` API. + +## Proposal + +Overall, this KEP is mostly just about documentation and tests; it is +already possible to disable the Endpoints and EndpointSlice Mirroring +controllers via kube-controller-manager's `--controllers` option, and +we believe that this will have no ill effects in a vanilla Kubernetes +cluster (though, currently, it will cause the e2e tests to fail). + +### API Server Warnings + +Whenever any Endpoints object is created or updated (except by the +Endpoints controller), the apiserver will return the warning: + +``` +The Endpoints API is deprecated, and all users should use the EndpointSlice API instead. +``` + +### Endpoints Cleanup? + +We do not want to leave stale `Endpoints` objects around forever if +the associated controllers are not running. (This is both a waste of +disk space and a potential source of confusion since the Endpoints +objects would quickly become out-of-date and incorrect.) + +One possibility would be to just document that administrators should +delete all existing Endpoints themselves if they are going to disable +the controllers. + +Another possibility would be to have kube-controller-manager do this +automatically. (Though the idea of having it automatically take action +because a controller _isn't_ enabled seems slightly magical? But +perhaps there is precedent somewhere else?) + +Another possibility would be to create an `endpoints-cleanup` +controller, that could be enabled explicitly, and document that admins +should (probably) enable that controller if they disable the others. + +Or perhaps the EndpointSlice controller could delete Endpoints objects +that were more than 24 hours out of date with respect to their +EndpointSlices? + +In all cases, we should probably not automatically delete Endpoints +that were not originally created by the Endpoints controller. + +### E2E Test Updates + +There are a surprising number of e2e tests that still make use of +Endpoints, mostly because there was never any active effort to port +old tests away. These will need to be updated. See the [e2e +tests](#e2e-tests) section for more details. + +### Risks and Mitigations + +Obviously if a cluster contains components that read Endpoints +objects, then disabling the Endpoints controllers would break those +clusters. Given that the `v1.Endpoints` type would still exist in +these clusters, the failure mode would not be the component would fail +entirely with errors; it would just think that the Endpoints it was +looking for didn't exist ("yet"?). + +We would need to mitigate this by helping users to figure out if +anything in their cluster depends on Endpoints. + +## Design Details + + + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +N/A + +##### Unit tests + +We will have unit tests for the API warning code. + + + +- ``: `` - `` + +##### Integration tests + +TBD + + + +- : + +##### e2e tests + +We will want to add a new periodic e2e job that confirms that the e2e +suite passes in a cluster with the Endpoints controllers disabled. +This will require also adding a Feature tag to allow skipping the +Endpoints-specific tests. + +There are quite a few places in the e2e tests that currently use +`v1.Endpoints`: + + - Many of the tests in `test/e2e/network/service.go` do various + checks on both Endpoints and EndpointSlices. This will need to be + split into separate tests, with the Endpoints tests + feature-tagged. + + - Some of the tests in `test/e2e/network/endpointslice.go` will need + to be split up, to separately test "EndpointSlices are created + correctly" and "EndpointSlices match Endpoints" in separate tests, + with the Endpoints tests feature-tagged. + + - The conformance tests in + `test/e2e/network/endpointslicemirroring.go` should be demoted + from conformance, and all of the tests should be feature-tagged, + but should otherwise be unchanged. + + - Several tests in `test/e2e/network/dual_stack.go` check that the + Endpoints controller does the right thing but _do not_ check that + the EndpointSlice controller does the right thing (which means + that we do not actually have any proper e2e testing of dual-stack + Services). These should be updated to test EndpointSlices, with + the Endpoints tests split out into separate feature-tagged tests. + + - `[It] [sig-network] Services should test the lifecycle of an + Endpoint [Conformance]`: This just tests that the API works and + does not test the behavior of the controllers, so it just needs to + be feature-tagged, but not changed. + + - `[It] [sig-network] Service endpoints latency should not be very + high [Conformance]`: This tests the latency of the Endpoints + controller, and should be fixed to test the latency of the + EndpointSlice controller instead, since the latency of the + Endpoints controller has no impact on the functioning of a + cluster. + + + +- : + +### Graduation Criteria + +#### Alpha + +- All Endpoints and EndpointSlice Mirroring tests have been demoted + from conformance, and we document this (e.g. via a blog post). + +- All official Kubernetes documentation is updated to primarily + discuss EndpointSlices, and to mention Endpoints only as a + deprecated API. No examples use `kubectl get endpoints` or involve + creating an Endpoints object. + +#### Beta + +- (There is no requirement for any particular amount of time to have + passed between Alpha and Beta.) + +- We have a passing "no Endpoints" e2e job. + +- Warnings are emitted when users create Endpoints. + +- If we are going to add metrics or other "Am I using Endpoints?" + diagnostics, then they are added and have tests. + +- The documentation explains how to disable the Endpoints and + EndpointSlice Mirroring controllers, but warns that third-party + components may still depend on them. + +#### GA + +- Some amount of time has passed since Beta. + +- The documentation becomes slightly more bullish on the idea of + disabling the controllers. + +- (If we decided that we were going to eventually move the controllers + from the "on-by-default" group to the "off-by-default" group, then + we might decide to not call this KEP GA until that happens, which + could be quite a few releases away.) + +### Upgrade / Downgrade Strategy + +The KEP does not propose any automatic change to behavior; behavior +would only be changed when the administrator chose to disable the +controllers, which could happen at any time. + +### Version Skew Strategy + +There are no skew issues because there are no new or modified APIs, +and because no (recent) versions of Kubernetes ever look at Endpoints. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +The KEP does not define a new feature, it just proposes that we +document that users are allowed to disable the Endpoints and +EndpointSlice Mirroring controllers. + +###### Does enabling the feature change any default behavior? + +The KEP does not define a new feature that can be enabled. + +(Obviously disabling the endpoints controllers changes default +behavior, but this would be because the administrator chose to do +that, not something that would happen automatically.) + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +The KEP does not define a new feature that can be enabled/disabled. + +(If an administrator chooses to disable the endpoints controllers, and +then decides this was a bad idea, they can re-enable them, and even if +something had previously deleted all of the autogenerated Endpoints +objects, re-enabling the endpoints controller would regenerate them.) + +###### What happens if we reenable the feature if it was previously rolled back? + +N/A + +###### Are there any tests for feature enablement/disablement? + +N/A: The KEP does not define a new feature that can be enabled/disabled. + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +Disabling the endpoints controllers in a cluster where some +third-party components depend on them could have arbitrarily bad +effects. + +###### What specific metrics should inform a rollback? + +There are currently none. There could perhaps be metrics counting +Gets, Lists, and Watches of v1.Endpoints, which might suggest the +existence of components that had not been updated to use +EndpointSlices. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +N/A + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +N/A + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +The feature would be enabled by the operator. + +###### How can someone using this feature know that it is working for their instance? + +- [X] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +There are no specific SLOs, but kube-controller-manager, +kube-apiserver, and etcd should use less CPU, and etcd should use less +disk space, if the controllers are disabled. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +N/A + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +(See above about metrics informing a rollback.) + +### Dependencies + +###### Does this feature depend on any specific services running in the cluster? + +No + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +No + +###### Will enabling / using this feature result in introducing new API types? + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No; the goal of this KEP is to drastically reduce the overall size of +the API database. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +No; the goal of this KEP is to drastically reduce the overall size of +the API database, and to somewhat reduce the CPU usage of +kube-controller-manager, kube-apiserver, and etcd. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +N/A + +###### What are other known failure modes? + +None + +###### What steps should be taken if SLOs are not being met to determine the problem? + +N/A + +## Implementation History + +- Initial proposal: 2024-11-21 diff --git a/keps/sig-network/4974-deprecate-endpoints/kep.yaml b/keps/sig-network/4974-deprecate-endpoints/kep.yaml new file mode 100644 index 00000000000..6635f4964c5 --- /dev/null +++ b/keps/sig-network/4974-deprecate-endpoints/kep.yaml @@ -0,0 +1,37 @@ +title: Deprecate v1.Endpoints +kep-number: 4974 +authors: + - "@danwinship" +owning-sig: sig-network +participating-sigs: +status: provisional +creation-date: 2024-11-21 +reviewers: + - "@aojea" + - "@robscott" +approvers: + - "@aojea" + - "@thockin" +see-also: + - "/keps/sig-network/0752-endpointslices" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + beta: "v1.35" + stable: "v1.37" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + +# The following PRR answers are required at beta release +metrics: