diff --git a/keps/prod-readiness/sig-auth/740.yaml b/keps/prod-readiness/sig-auth/740.yaml index ab1ee0b73b9..a661ea1a268 100644 --- a/keps/prod-readiness/sig-auth/740.yaml +++ b/keps/prod-readiness/sig-auth/740.yaml @@ -1,3 +1,5 @@ kep-number: 740 alpha: - approver: "@soltysh" \ No newline at end of file + approver: "@soltysh" +beta: + approver: "@soltysh" diff --git a/keps/sig-auth/740-service-account-external-signing/README.md b/keps/sig-auth/740-service-account-external-signing/README.md index c72ec95670e..ba18e6a49a0 100644 --- a/keps/sig-auth/740-service-account-external-signing/README.md +++ b/keps/sig-auth/740-service-account-external-signing/README.md @@ -24,7 +24,6 @@ - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) - [Integration tests](#integration-tests) - - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) - [Beta](#beta) @@ -50,20 +49,20 @@ Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` - [x] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - - [ ] e2e Tests for all Beta API Operations (endpoints) - - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) -- [ ] (R) Production readiness review completed -- [ ] (R) Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] ~~e2e Tests for all Beta API Operations (endpoints)~~ no API endpoints + - [ ] ~~(R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)~~ no API endpoints + - [ ] ~~(R) Minimum Two Week Window for GA e2e tests to prove flake free~~ no API endpoints +- [x] (R) Graduation criteria is in place + - [ ] ~~(R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)~~ no API endpoints +- [x] (R) Production readiness review completed +- [x] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone +- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes [kubernetes.io]: https://kubernetes.io/ [kubernetes/enhancements]: https://git.k8s.io/enhancements @@ -142,9 +141,9 @@ A new versioned grpc API (ExternalJWTSigner) will be created under `k8s.io/kuber #### Support for Legacy Tokens Implementers will have following options for legacy token support: -1. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs. -2. Turn off the loop (don't support legacy tokens) if external signing is enabled. -3. Create a custom external signer for legacy tokens using Controller loop from staging repo (This option will only be available if demanded by Community as part of feedback for Beta graduation). +1. Turn off the loop (don't support legacy tokens) if external signing is enabled. (recommended to avoid non-expiring tokens) +2. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs. +3. Turn off the loop in kube-controller-manager and create a custom external signer for legacy tokens that obtains them via the external signer. ### Risks and Mitigations @@ -280,13 +279,12 @@ to implement this enhancement. ##### Integration tests - Create a cluster with ExternalJWTSigner to configure an external signer and verify TokenRequest and TokenReview APIs work properly. - -##### e2e tests - -- Create a cluster with ExternalJWTSigner configured. - Request a token for a service account principal. - Use a token as bearer for making requests to kube-apiserver and ensure it succeeds. +- [TestExternalJWTSigningAndAuth](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L46C6-L46C35): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount) +- [TestDelayedStartForSigner](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L282): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount) + ### Graduation Criteria #### Alpha @@ -296,13 +294,15 @@ to implement this enhancement. #### Beta -- E2E tests are completed. -- We have at least one ExternalSigner implementation working with this change. +- All tests are completed. +- We have at least one ExternalSigner integration working with this change. + - GKE integration is complete - Decide whether to externalize legacy token controller code in a staging repo. Check [Support for Legacy Tokens](#support-for-legacy-tokens) for details. + - Decided not to externalize legacy token controller code #### GA -- More than one ExternalSigner implementations are completed. +- More than one ExternalSigner integration are completed. - Feature is tuned with feedback from distributions. ### Upgrade/Downgrade Strategy @@ -425,10 +425,13 @@ No. The Feature would not be used by workload directly but will be used by kube-apiserver. -The usage should be visible to the operator using Audit logs. - +The usage should be visible to the operator via these metrics: + +- apiserver_externaljwt_fetch_keys_data_timestamp +- apiserver_externaljwt_fetch_keys_request_total +- apiserver_externaljwt_fetch_keys_success_timestamp +- apiserver_externaljwt_request_duration_seconds +- apiserver_externaljwt_sign_request_total ###### How can someone using this feature know that it is working for their instance? @@ -440,16 +443,25 @@ The usage should be visible to the operator using Audit logs. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - +It is expected that external `Sign` request durations will be dominated by the external signer implementation. +Instrumenting processing time of your external signer implementation is recommended. + +Experimentally, the gRPC overhead adds about 1ms to a TokenRequest, comparing the in-tree kube-apiserver +service account token signer with a stub external signer still doing local signing. + +The `apiserver_externaljwt_request_duration_seconds{method=Sign,code=OK}` metrics +are expected to be within 1-10ms of the external signer processing time. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - [x] Metrics - Metric name: `apiserver_request_total` and `apiserver_request_duration_seconds` - - Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"` - - Components exposing the metric: kube-apiserver + - Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"` + - Components exposing the metric: kube-apiserver + - Metric name: `apiserver_externaljwt_sign_request_total` + - Components exposing the metric: kube-apiserver + - Metric name: `apiserver_externaljwt_request_duration_seconds` + - Components exposing the metric: kube-apiserver ###### Are there any missing metrics that would be useful to have to improve observability of this feature? @@ -468,7 +480,6 @@ Needs Benchmarking on SLIs. - It is the integrator's responsibility to ensure that their ExternalJWTSigner implementation support signing tokens with 1 year validity i.e. if their clusters are relying on extended token lifetimes. - integrators can observe the `serviceaccount_stale_tokens_total` metric to confirm their cluster's reliance on `--service-account-extend-token-expiration`. - ### Dependencies One new dependency will be introduced and it will only be required for clusters configured/opted-in via the `--service-account-signing-endpoint` flag. @@ -550,16 +561,38 @@ not likely. ### Troubleshooting - +Symptom: kube-apiserver will not start with `--service-account-signing-endpoint` set + +- check the kube-apiserver log for details about why startup failed +- ensure the socket `--service-account-signing-endpoint` points to is valid, + the kube-apiserver user has permissions to access it, and the external signer is running +- ensure `--service-account-signing-key-file` and `--service-account-key-file` are not also set +- ensure the external signer supports the version of the externaljwt gRPC API kube-apiserver is using +- ensure the maximum supported token lifetime returned by the external signer does not conflict with any + `--service-account-max-token-expiration` flag (the flag may not be longer than the max expiration supported by the external signer) + +Symptom: token creation fails with `500` errors + +- check `apiserver_externaljwt_sign_request_total` metrics for codes other than `OK` to determine if signing failures are the cause +- if signing requests are failing with `CANCELLED` or `DEADLINE_EXCEEDED` codes, + check `apiserver_externaljwt_request_duration_seconds` metrics for timing distribution + of external signing requests with `method=Sign`. If external signing is causing request timeouts, + investigate improving the performance of your external signer integration. +- check the kube-apiserver log for details about other signing failures + +Symptom: token use fails with authentication errors + +- check the `apiserver_externaljwt_fetch_keys_request_total` metrics for codes other than `OK` + to determine if verifying keys are failing to be fetched +- check the `apiserver_externaljwt_fetch_keys_success_timestamp` metric to determine the + last time public keys were successfully refreshed. If this exceeds the expected `refresh_hint_seconds` + value for your particular external signer integration, check `kube-apiserver` logs for details on why + the public key fetch is failing. +- check the `apiserver_externaljwt_fetch_keys_data_timestamp` metric to determine the `data_timestamp` + reported by the external signer in the last successful fetch of public keys. Compare to the expected + value for your particular external signer integration to determine if `kube-apiserver` is using current + public keys. If this does not match, check your external signer for details on why it is not returning + the expected public keys to the `FetchKeys` method. ###### How does this feature react if the API server and/or etcd is unavailable? @@ -567,18 +600,7 @@ feature is only accessible via kube-apiserver. JWT signing and authentication wi ###### What are other known failure modes? - +Covered above in the troubleshooting section. ###### What steps should be taken if SLOs are not being met to determine the problem? @@ -590,6 +612,10 @@ Initial PRs: - kubernetes/kubernetes#73110 - kubernetes/kubernetes#125177 +1.32: Alpha release + +1.34: Beta release + ## Drawbacks Enabling the feature puts a remote service in the critical path of kube-apiserver. Thus, it can easily cause an outage. However, we have some relief in that it is an opt-in/configurable feature. diff --git a/keps/sig-auth/740-service-account-external-signing/kep.yaml b/keps/sig-auth/740-service-account-external-signing/kep.yaml index 87f6d013136..ca070a52198 100644 --- a/keps/sig-auth/740-service-account-external-signing/kep.yaml +++ b/keps/sig-auth/740-service-account-external-signing/kep.yaml @@ -17,10 +17,11 @@ approvers: stage: alpha -latest-milestone: "v1.33" +latest-milestone: "v1.34" milestone: alpha: "v1.32" + beta: "v1.34" feature-gates: - name: ExternalServiceAccountTokenSigner @@ -34,4 +35,21 @@ metrics: - serviceaccount_valid_tokens_total - apiserver_request_duration_seconds - serviceaccount_stale_tokens_total - + # Unix Timestamp in seconds of the last successful FetchKeys data_timestamp value returned by the external signer + # Type: Gauge + - apiserver_externaljwt_fetch_keys_data_timestamp + # Total attempts at syncing supported JWKs + # Type: Counter + # Labels:code + - apiserver_externaljwt_fetch_keys_request_total + # Unix Timestamp in seconds of the last successful FetchKeys request + # Type: Gauge + - apiserver_externaljwt_fetch_keys_success_timestamp + # Request duration and time for calls to external-jwt-signer + # Type: Histogram + # Labels:code,method + - apiserver_externaljwt_request_duration_seconds + # Total attempts at signing JWT + # Type: Counter + # Labels:code + - apiserver_externaljwt_sign_request_total