Skip to content

Commit 71af620

Browse files
committed
KEP-740: promote external token signing to beta
1 parent c1a01d2 commit 71af620

File tree

3 files changed

+105
-59
lines changed

3 files changed

+105
-59
lines changed

keps/prod-readiness/sig-auth/740.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 740
22
alpha:
3-
approver: "@soltysh"
3+
approver: "@soltysh"
4+
beta:
5+
approver: "@soltysh"

keps/sig-auth/740-service-account-external-signing/README.md

Lines changed: 82 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
- [Prerequisite testing updates](#prerequisite-testing-updates)
2525
- [Unit tests](#unit-tests)
2626
- [Integration tests](#integration-tests)
27-
- [e2e tests](#e2e-tests)
2827
- [Graduation Criteria](#graduation-criteria)
2928
- [Alpha](#alpha)
3029
- [Beta](#beta)
@@ -50,20 +49,20 @@
5049

5150
Items marked with (R) are required *prior to targeting to a milestone / release*.
5251

53-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
54-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
52+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
53+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
5554
- [x] (R) Design details are appropriately documented
56-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
57-
- [ ] e2e Tests for all Beta API Operations (endpoints)
58-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
59-
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
60-
- [ ] (R) Graduation criteria is in place
61-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
62-
- [ ] (R) Production readiness review completed
63-
- [ ] (R) Production readiness review approved
64-
- [ ] "Implementation History" section is up-to-date for milestone
65-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
66-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
55+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
56+
- [ ] ~~e2e Tests for all Beta API Operations (endpoints)~~ no API endpoints
57+
- [ ] ~~(R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)~~ no API endpoints
58+
- [ ] ~~(R) Minimum Two Week Window for GA e2e tests to prove flake free~~ no API endpoints
59+
- [x] (R) Graduation criteria is in place
60+
- [ ] ~~(R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)~~ no API endpoints
61+
- [x] (R) Production readiness review completed
62+
- [x] (R) Production readiness review approved
63+
- [x] "Implementation History" section is up-to-date for milestone
64+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
65+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6766

6867
[kubernetes.io]: https://kubernetes.io/
6968
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -142,9 +141,9 @@ A new versioned grpc API (ExternalJWTSigner) will be created under `k8s.io/kuber
142141
#### Support for Legacy Tokens
143142

144143
Implementers will have following options for legacy token support:
145-
1. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs.
146-
2. Turn off the loop (don't support legacy tokens) if external signing is enabled.
147-
3. Create a custom external signer for legacy tokens using Controller loop from staging repo (This option will only be available if demanded by Community as part of feedback for Beta graduation).
144+
1. Turn off the loop (don't support legacy tokens) if external signing is enabled. (recommended to avoid non-expiring tokens)
145+
2. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs.
146+
3. Turn off the loop in kube-controller-manager and create a custom external signer for legacy tokens that obtains them via the external signer.
148147

149148
### Risks and Mitigations
150149

@@ -280,13 +279,12 @@ to implement this enhancement.
280279
##### Integration tests
281280

282281
- Create a cluster with ExternalJWTSigner to configure an external signer and verify TokenRequest and TokenReview APIs work properly.
283-
284-
##### e2e tests
285-
286-
- Create a cluster with ExternalJWTSigner configured.
287282
- Request a token for a service account principal.
288283
- Use a token as bearer for making requests to kube-apiserver and ensure it succeeds.
289284

285+
- [TestExternalJWTSigningAndAuth](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L46C6-L46C35): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount)
286+
- [TestDelayedStartForSigner](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L282): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount)
287+
290288
### Graduation Criteria
291289

292290
#### Alpha
@@ -296,13 +294,15 @@ to implement this enhancement.
296294

297295
#### Beta
298296

299-
- E2E tests are completed.
300-
- We have at least one ExternalSigner implementation working with this change.
297+
- All tests are completed.
298+
- We have at least one ExternalSigner integration working with this change.
299+
- GKE integration is complete
301300
- Decide whether to externalize legacy token controller code in a staging repo. Check [Support for Legacy Tokens](#support-for-legacy-tokens) for details.
301+
- Decided not to externalize legacy token controller code
302302

303303
#### GA
304304

305-
- More than one ExternalSigner implementations are completed.
305+
- More than one ExternalSigner integration are completed.
306306
- Feature is tuned with feedback from distributions.
307307

308308
### Upgrade/Downgrade Strategy
@@ -425,10 +425,13 @@ No.
425425
426426
The Feature would not be used by workload directly but will be used by kube-apiserver.
427427
428-
The usage should be visible to the operator using Audit logs.
429-
<!-- TODO
430-
Add details on increasing audit log surface area for External signers
431-
-->
428+
The usage should be visible to the operator via these metrics:
429+
430+
- apiserver_externaljwt_fetch_keys_data_timestamp
431+
- apiserver_externaljwt_fetch_keys_request_total
432+
- apiserver_externaljwt_fetch_keys_success_timestamp
433+
- apiserver_externaljwt_request_duration_seconds
434+
- apiserver_externaljwt_sign_request_total
432435
433436
###### How can someone using this feature know that it is working for their instance?
434437
@@ -440,16 +443,25 @@ The usage should be visible to the operator using Audit logs.
440443
441444
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
442445
443-
<!-- TODO
444-
Needs Benchmarking on SLIs.
445-
-->
446+
It is expected that external `Sign` request durations will be dominated by the external signer implementation.
447+
Instrumenting processing time of your external signer implementation is recommended.
448+
449+
Experimentally, the gRPC overhead adds about 1ms to a TokenRequest, comparing the in-tree kube-apiserver
450+
service account token signer with a stub external signer still doing local signing.
451+
452+
The `apiserver_externaljwt_request_duration_seconds{method=Sign,code=OK}` metrics
453+
are expected to be within 1-10ms of the external signer processing time.
446454
447455
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
448456
449457
- [x] Metrics
450458
- Metric name: `apiserver_request_total` and `apiserver_request_duration_seconds`
451-
- Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"`
452-
- Components exposing the metric: kube-apiserver
459+
- Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"`
460+
- Components exposing the metric: kube-apiserver
461+
- Metric name: `apiserver_externaljwt_sign_request_total`
462+
- Components exposing the metric: kube-apiserver
463+
- Metric name: `apiserver_externaljwt_request_duration_seconds`
464+
- Components exposing the metric: kube-apiserver
453465
454466
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
455467
@@ -468,7 +480,6 @@ Needs Benchmarking on SLIs.
468480
- It is the integrator's responsibility to ensure that their ExternalJWTSigner implementation support signing tokens with 1 year validity i.e. if their clusters are relying on extended token lifetimes.
469481
- integrators can observe the `serviceaccount_stale_tokens_total` metric to confirm their cluster's reliance on `--service-account-extend-token-expiration`.
470482
471-
472483
### Dependencies
473484
474485
One new dependency will be introduced and it will only be required for clusters configured/opted-in via the `--service-account-signing-endpoint` flag.
@@ -550,35 +561,46 @@ not likely.
550561
551562
### Troubleshooting
552563
553-
<!-- TODO
554-
This section must be completed when targeting beta to a release.
555-
556-
For GA, this section is required: approvers should be able to confirm the
557-
previous answers based on experience in the field.
558-
559-
The Troubleshooting section currently serves the `Playbook` role. We may consider
560-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
561-
details). For now, we leave it here.
562-
-->
564+
Symptom: kube-apiserver will not start with `--service-account-signing-endpoint` set
565+
566+
- check the kube-apiserver log for details about why startup failed
567+
- ensure the socket `--service-account-signing-endpoint` points to is valid,
568+
the kube-apiserver user has permissions to access it, and the external signer is running
569+
- ensure `--service-account-signing-key-file` and `--service-account-key-file` are not also set
570+
- ensure the external signer supports the version of the externaljwt gRPC API kube-apiserver is using
571+
- ensure the maximum supported token lifetime returned by the external signer does not conflict with any
572+
`--service-account-max-token-expiration` flag (the flag may not be longer than the max expiration supported by the external signer)
573+
574+
Symptom: token creation fails with `500` errors
575+
576+
- check `apiserver_externaljwt_sign_request_total` metrics for codes other than `OK` to determine if signing failures are the cause
577+
- if signing requests are failing with `CANCELLED` or `DEADLINE_EXCEEDED` codes,
578+
check `apiserver_externaljwt_request_duration_seconds` metrics for timing distribution
579+
of external signing requests with `method=Sign`. If external signing is causing request timeouts,
580+
investigate improving the performance of your external signer integration.
581+
- check the kube-apiserver log for details about other signing failures
582+
583+
Symptom: token use fails with authentication errors
584+
585+
- check the `apiserver_externaljwt_fetch_keys_request_total` metrics for codes other than `OK`
586+
to determine if verifying keys are failing to be fetched
587+
- check the `apiserver_externaljwt_fetch_keys_success_timestamp` metric to determine the
588+
last time public keys were successfully refreshed. If this exceeds the expected `refresh_hint_seconds`
589+
value for your particular external signer integration, check `kube-apiserver` logs for details on why
590+
the public key fetch is failing.
591+
- check the `apiserver_externaljwt_fetch_keys_data_timestamp` metric to determine the `data_timestamp`
592+
reported by the external signer in the last successful fetch of public keys. Compare to the expected
593+
value for your particular external signer integration to determine if `kube-apiserver` is using current
594+
public keys. If this does not match, check your external signer for details on why it is not returning
595+
the expected public keys to the `FetchKeys` method.
563596
564597
###### How does this feature react if the API server and/or etcd is unavailable?
565598
566599
feature is only accessible via kube-apiserver. JWT signing and authentication will anyways not work without kube-apiserver.
567600
568601
###### What are other known failure modes?
569602
570-
<!-- TODO
571-
For each of them, fill in the following information by copying the below template:
572-
- [Failure mode brief description]
573-
- Detection: How can it be detected via metrics? Stated another way:
574-
How can an operator troubleshoot without logging into a control plane or worker node?
575-
- Mitigations: What can be done to stop the bleeding, especially for already
576-
running user workloads?
577-
- Diagnostics: What are the useful log messages and their required logging
578-
levels that could help debug the issue?
579-
Not required until the feature graduated to beta.
580-
- Testing: Are there any tests for failure mode? If not, describe why.
581-
-->
603+
Covered above in the troubleshooting section.
582604
583605
###### What steps should be taken if SLOs are not being met to determine the problem?
584606
@@ -590,6 +612,10 @@ Initial PRs:
590612
- kubernetes/kubernetes#73110
591613
- kubernetes/kubernetes#125177
592614
615+
1.32: Alpha release
616+
617+
1.34: Beta release
618+
593619
## Drawbacks
594620
595621
Enabling the feature puts a remote service in the critical path of kube-apiserver. Thus, it can easily cause an outage. However, we have some relief in that it is an opt-in/configurable feature.

keps/sig-auth/740-service-account-external-signing/kep.yaml

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,11 @@ approvers:
1717

1818
stage: alpha
1919

20-
latest-milestone: "v1.33"
20+
latest-milestone: "v1.34"
2121

2222
milestone:
2323
alpha: "v1.32"
24+
beta: "v1.34"
2425

2526
feature-gates:
2627
- name: ExternalServiceAccountTokenSigner
@@ -34,4 +35,21 @@ metrics:
3435
- serviceaccount_valid_tokens_total
3536
- apiserver_request_duration_seconds
3637
- serviceaccount_stale_tokens_total
37-
38+
# Unix Timestamp in seconds of the last successful FetchKeys data_timestamp value returned by the external signer
39+
# Type: Gauge
40+
- apiserver_externaljwt_fetch_keys_data_timestamp
41+
# Total attempts at syncing supported JWKs
42+
# Type: Counter
43+
# Labels:code
44+
- apiserver_externaljwt_fetch_keys_request_total
45+
# Unix Timestamp in seconds of the last successful FetchKeys request
46+
# Type: Gauge
47+
- apiserver_externaljwt_fetch_keys_success_timestamp
48+
# Request duration and time for calls to external-jwt-signer
49+
# Type: Histogram
50+
# Labels:code,method
51+
- apiserver_externaljwt_request_duration_seconds
52+
# Total attempts at signing JWT
53+
# Type: Counter
54+
# Labels:code
55+
- apiserver_externaljwt_sign_request_total

0 commit comments

Comments
 (0)