Skip to content

Commit a40c35e

Browse files
committed
KEP-740: promote external token signing to beta
1 parent c1a01d2 commit a40c35e

File tree

3 files changed

+100
-58
lines changed

3 files changed

+100
-58
lines changed

keps/prod-readiness/sig-auth/740.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 740
22
alpha:
3-
approver: "@soltysh"
3+
approver: "@soltysh"
4+
beta:
5+
approver: "@soltysh"

keps/sig-auth/740-service-account-external-signing/README.md

Lines changed: 77 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
- [Prerequisite testing updates](#prerequisite-testing-updates)
2525
- [Unit tests](#unit-tests)
2626
- [Integration tests](#integration-tests)
27-
- [e2e tests](#e2e-tests)
2827
- [Graduation Criteria](#graduation-criteria)
2928
- [Alpha](#alpha)
3029
- [Beta](#beta)
@@ -50,20 +49,16 @@
5049

5150
Items marked with (R) are required *prior to targeting to a milestone / release*.
5251

53-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
54-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
52+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
53+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
5554
- [x] (R) Design details are appropriately documented
56-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
57-
- [ ] e2e Tests for all Beta API Operations (endpoints)
58-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
59-
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
60-
- [ ] (R) Graduation criteria is in place
61-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
62-
- [ ] (R) Production readiness review completed
63-
- [ ] (R) Production readiness review approved
55+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
56+
- [x] (R) Graduation criteria is in place
57+
- [x] (R) Production readiness review completed
58+
- [x] (R) Production readiness review approved
6459
- [ ] "Implementation History" section is up-to-date for milestone
65-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
66-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
60+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
61+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6762

6863
[kubernetes.io]: https://kubernetes.io/
6964
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -142,9 +137,9 @@ A new versioned grpc API (ExternalJWTSigner) will be created under `k8s.io/kuber
142137
#### Support for Legacy Tokens
143138

144139
Implementers will have following options for legacy token support:
145-
1. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs.
146-
2. Turn off the loop (don't support legacy tokens) if external signing is enabled.
147-
3. Create a custom external signer for legacy tokens using Controller loop from staging repo (This option will only be available if demanded by Community as part of feedback for Beta graduation).
140+
1. Turn off the loop (don't support legacy tokens) if external signing is enabled. (recommended to avoid non-expiring tokens)
141+
2. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs.
142+
3. Turn off the loop in kube-controller-manager and create a custom external signer for legacy tokens that obtains them via the external signer.
148143

149144
### Risks and Mitigations
150145

@@ -280,13 +275,12 @@ to implement this enhancement.
280275
##### Integration tests
281276

282277
- Create a cluster with ExternalJWTSigner to configure an external signer and verify TokenRequest and TokenReview APIs work properly.
283-
284-
##### e2e tests
285-
286-
- Create a cluster with ExternalJWTSigner configured.
287278
- Request a token for a service account principal.
288279
- Use a token as bearer for making requests to kube-apiserver and ensure it succeeds.
289280

281+
- [TestExternalJWTSigningAndAuth](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L46C6-L46C35): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount)
282+
- [TestDelayedStartForSigner](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L282): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount)
283+
290284
### Graduation Criteria
291285

292286
#### Alpha
@@ -296,13 +290,15 @@ to implement this enhancement.
296290

297291
#### Beta
298292

299-
- E2E tests are completed.
300-
- We have at least one ExternalSigner implementation working with this change.
293+
- All tests are completed.
294+
- We have at least one ExternalSigner integration working with this change.
295+
- GKE integration is complete
301296
- Decide whether to externalize legacy token controller code in a staging repo. Check [Support for Legacy Tokens](#support-for-legacy-tokens) for details.
297+
- Decided not to externalize legacy token controller code
302298

303299
#### GA
304300

305-
- More than one ExternalSigner implementations are completed.
301+
- More than one ExternalSigner integration are completed.
306302
- Feature is tuned with feedback from distributions.
307303

308304
### Upgrade/Downgrade Strategy
@@ -425,10 +421,13 @@ No.
425421
426422
The Feature would not be used by workload directly but will be used by kube-apiserver.
427423
428-
The usage should be visible to the operator using Audit logs.
429-
<!-- TODO
430-
Add details on increasing audit log surface area for External signers
431-
-->
424+
The usage should be visible to the operator via these metrics:
425+
426+
- apiserver_externaljwt_fetch_keys_data_timestamp
427+
- apiserver_externaljwt_fetch_keys_request_total
428+
- apiserver_externaljwt_fetch_keys_success_timestamp
429+
- apiserver_externaljwt_request_duration_seconds
430+
- apiserver_externaljwt_sign_request_total
432431
433432
###### How can someone using this feature know that it is working for their instance?
434433
@@ -440,16 +439,25 @@ The usage should be visible to the operator using Audit logs.
440439
441440
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
442441
443-
<!-- TODO
444-
Needs Benchmarking on SLIs.
445-
-->
442+
It is expected that external `Sign` request durations will be dominated by the external signer implementation.
443+
Instrumenting processing time of your external signer implementation is recommended.
444+
445+
Experimentally, the gRPC overhead adds about 1ms to a TokenRequest, comparing the in-tree kube-apiserver
446+
service account token signer with a stub external signer still doing local signing.
447+
448+
The `apiserver_externaljwt_request_duration_seconds{method=Sign,code=OK}` metrics
449+
are expected to be within 1-10ms of the external signer processing time.
446450
447451
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
448452
449453
- [x] Metrics
450454
- Metric name: `apiserver_request_total` and `apiserver_request_duration_seconds`
451-
- Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"`
452-
- Components exposing the metric: kube-apiserver
455+
- Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"`
456+
- Components exposing the metric: kube-apiserver
457+
- Metric name: `apiserver_externaljwt_sign_request_total`
458+
- Components exposing the metric: kube-apiserver
459+
- Metric name: `apiserver_externaljwt_request_duration_seconds`
460+
- Components exposing the metric: kube-apiserver
453461
454462
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
455463
@@ -468,7 +476,6 @@ Needs Benchmarking on SLIs.
468476
- It is the integrator's responsibility to ensure that their ExternalJWTSigner implementation support signing tokens with 1 year validity i.e. if their clusters are relying on extended token lifetimes.
469477
- integrators can observe the `serviceaccount_stale_tokens_total` metric to confirm their cluster's reliance on `--service-account-extend-token-expiration`.
470478
471-
472479
### Dependencies
473480
474481
One new dependency will be introduced and it will only be required for clusters configured/opted-in via the `--service-account-signing-endpoint` flag.
@@ -550,35 +557,46 @@ not likely.
550557
551558
### Troubleshooting
552559
553-
<!-- TODO
554-
This section must be completed when targeting beta to a release.
555-
556-
For GA, this section is required: approvers should be able to confirm the
557-
previous answers based on experience in the field.
558-
559-
The Troubleshooting section currently serves the `Playbook` role. We may consider
560-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
561-
details). For now, we leave it here.
562-
-->
560+
Symptom: kube-apiserver will not start with `--service-account-signing-endpoint` set
561+
562+
- check the kube-apiserver log for details about why startup failed
563+
- ensure the socket `--service-account-signing-endpoint` points to is valid,
564+
the kube-apiserver user has permissions to access it, and the external signer is running
565+
- ensure `--service-account-signing-key-file` and `--service-account-key-file` are not also set
566+
- ensure the external signer supports the version of the externaljwt gRPC API kube-apiserver is using
567+
- ensure the maximum supported token lifetime returned by the external signer does not conflict with any
568+
`--service-account-max-token-expiration` flag (the flag may not be longer than the max expiration supported by the external signer)
569+
570+
Symptom: token creation fails with `500` errors
571+
572+
- check `apiserver_externaljwt_sign_request_total` metrics for codes other than `OK` to determine if signing failures are the cause
573+
- if signing requests are failing with `CANCELLED` or `DEADLINE_EXCEEDED` codes,
574+
check `apiserver_externaljwt_request_duration_seconds` metrics for timing distribution
575+
of external signing requests with `method=Sign`. If external signing is causing request timeouts,
576+
investigate improving the performance of your external signer integration.
577+
- check the kube-apiserver log for details about other signing failures
578+
579+
Symptom: token use fails with authentication errors
580+
581+
- check the `apiserver_externaljwt_fetch_keys_request_total` metrics for codes other than `OK`
582+
to determine if verifying keys are failing to be fetched
583+
- check the `apiserver_externaljwt_fetch_keys_success_timestamp` metric to determine the
584+
last time public keys were successfully refreshed. If this exceeds the expected `refresh_hint_seconds`
585+
value for your particular external signer integration, check `kube-apiserver` logs for details on why
586+
the public key fetch is failing.
587+
- check the `apiserver_externaljwt_fetch_keys_data_timestamp` metric to determine the `data_timestamp`
588+
reported by the external signer in the last successful fetch of public keys. Compare to the expected
589+
value for your particular external signer integration to determine if `kube-apiserver` is using current
590+
public keys. If this does not match, check your external signer for details on why it is not returning
591+
the expected public keys to the `FetchKeys` method.
563592
564593
###### How does this feature react if the API server and/or etcd is unavailable?
565594
566595
feature is only accessible via kube-apiserver. JWT signing and authentication will anyways not work without kube-apiserver.
567596
568597
###### What are other known failure modes?
569598
570-
<!-- TODO
571-
For each of them, fill in the following information by copying the below template:
572-
- [Failure mode brief description]
573-
- Detection: How can it be detected via metrics? Stated another way:
574-
How can an operator troubleshoot without logging into a control plane or worker node?
575-
- Mitigations: What can be done to stop the bleeding, especially for already
576-
running user workloads?
577-
- Diagnostics: What are the useful log messages and their required logging
578-
levels that could help debug the issue?
579-
Not required until the feature graduated to beta.
580-
- Testing: Are there any tests for failure mode? If not, describe why.
581-
-->
599+
Covered above in the troubleshooting section.
582600
583601
###### What steps should be taken if SLOs are not being met to determine the problem?
584602
@@ -590,6 +608,10 @@ Initial PRs:
590608
- kubernetes/kubernetes#73110
591609
- kubernetes/kubernetes#125177
592610
611+
1.32: Alpha release
612+
613+
1.34: Beta release
614+
593615
## Drawbacks
594616
595617
Enabling the feature puts a remote service in the critical path of kube-apiserver. Thus, it can easily cause an outage. However, we have some relief in that it is an opt-in/configurable feature.

keps/sig-auth/740-service-account-external-signing/kep.yaml

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,11 @@ approvers:
1717

1818
stage: alpha
1919

20-
latest-milestone: "v1.33"
20+
latest-milestone: "v1.34"
2121

2222
milestone:
2323
alpha: "v1.32"
24+
beta: "v1.34"
2425

2526
feature-gates:
2627
- name: ExternalServiceAccountTokenSigner
@@ -34,4 +35,21 @@ metrics:
3435
- serviceaccount_valid_tokens_total
3536
- apiserver_request_duration_seconds
3637
- serviceaccount_stale_tokens_total
37-
38+
# Unix Timestamp in seconds of the last successful FetchKeys data_timestamp value returned by the external signer
39+
# Type: Gauge
40+
- apiserver_externaljwt_fetch_keys_data_timestamp
41+
# Total attempts at syncing supported JWKs
42+
# Type: Counter
43+
# Labels:code
44+
- apiserver_externaljwt_fetch_keys_request_total
45+
# Unix Timestamp in seconds of the last successful FetchKeys request
46+
# Type: Gauge
47+
- apiserver_externaljwt_fetch_keys_success_timestamp
48+
# Request duration and time for calls to external-jwt-signer
49+
# Type: Histogram
50+
# Labels:code,method
51+
- apiserver_externaljwt_request_duration_seconds
52+
# Total attempts at signing JWT
53+
# Type: Counter
54+
# Labels:code
55+
- apiserver_externaljwt_sign_request_total

0 commit comments

Comments
 (0)