Skip to content

KEP-740: promote external token signing to beta #5323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion keps/prod-readiness/sig-auth/740.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 740
alpha:
approver: "@soltysh"
approver: "@soltysh"
beta:
approver: "@soltysh"
138 changes: 82 additions & 56 deletions keps/sig-auth/740-service-account-external-signing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
Expand All @@ -50,20 +49,20 @@

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] ~~e2e Tests for all Beta API Operations (endpoints)~~ no API endpoints
- [ ] ~~(R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)~~ no API endpoints
- [ ] ~~(R) Minimum Two Week Window for GA e2e tests to prove flake free~~ no API endpoints
- [x] (R) Graduation criteria is in place
- [ ] ~~(R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)~~ no API endpoints
- [x] (R) Production readiness review completed
- [x] (R) Production readiness review approved
- [x] "Implementation History" section is up-to-date for milestone
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
Expand Down Expand Up @@ -142,9 +141,9 @@ A new versioned grpc API (ExternalJWTSigner) will be created under `k8s.io/kuber
#### Support for Legacy Tokens

Implementers will have following options for legacy token support:
1. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs.
2. Turn off the loop (don't support legacy tokens) if external signing is enabled.
3. Create a custom external signer for legacy tokens using Controller loop from staging repo (This option will only be available if demanded by Community as part of feedback for Beta graduation).
1. Turn off the loop (don't support legacy tokens) if external signing is enabled. (recommended to avoid non-expiring tokens)
2. Let the Controller loop run as it is with static signing keys. Stitch the public keys in external signer's JWKs.
3. Turn off the loop in kube-controller-manager and create a custom external signer for legacy tokens that obtains them via the external signer.

### Risks and Mitigations

Expand Down Expand Up @@ -280,13 +279,12 @@ to implement this enhancement.
##### Integration tests

- Create a cluster with ExternalJWTSigner to configure an external signer and verify TokenRequest and TokenReview APIs work properly.

##### e2e tests

- Create a cluster with ExternalJWTSigner configured.
- Request a token for a service account principal.
- Use a token as bearer for making requests to kube-apiserver and ensure it succeeds.

- [TestExternalJWTSigningAndAuth](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L46C6-L46C35): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount)
- [TestDelayedStartForSigner](https://github.com/kubernetes/kubernetes/blob/8aae5398b3885dc271d407c4d661e19653daaf88/test/integration/serviceaccount/external_jwt_signer_test.go#L282): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=serviceaccount), [triage search](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=serviceaccount)

### Graduation Criteria

#### Alpha
Expand All @@ -296,13 +294,15 @@ to implement this enhancement.

#### Beta

- E2E tests are completed.
- We have at least one ExternalSigner implementation working with this change.
- All tests are completed.
- We have at least one ExternalSigner integration working with this change.
- GKE integration is complete
- Decide whether to externalize legacy token controller code in a staging repo. Check [Support for Legacy Tokens](#support-for-legacy-tokens) for details.
- Decided not to externalize legacy token controller code

#### GA

- More than one ExternalSigner implementations are completed.
- More than one ExternalSigner integration are completed.
- Feature is tuned with feedback from distributions.

### Upgrade/Downgrade Strategy
Expand Down Expand Up @@ -425,10 +425,13 @@ No.

The Feature would not be used by workload directly but will be used by kube-apiserver.

The usage should be visible to the operator using Audit logs.
<!-- TODO
Add details on increasing audit log surface area for External signers
-->
The usage should be visible to the operator via these metrics:

- apiserver_externaljwt_fetch_keys_data_timestamp
- apiserver_externaljwt_fetch_keys_request_total
- apiserver_externaljwt_fetch_keys_success_timestamp
- apiserver_externaljwt_request_duration_seconds
- apiserver_externaljwt_sign_request_total

###### How can someone using this feature know that it is working for their instance?

Expand All @@ -440,16 +443,25 @@ The usage should be visible to the operator using Audit logs.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

<!-- TODO
Needs Benchmarking on SLIs.
-->
It is expected that external `Sign` request durations will be dominated by the external signer implementation.
Instrumenting processing time of your external signer implementation is recommended.

Experimentally, the gRPC overhead adds about 1ms to a TokenRequest, comparing the in-tree kube-apiserver
service account token signer with a stub external signer still doing local signing.

The `apiserver_externaljwt_request_duration_seconds{method=Sign,code=OK}` metrics
are expected to be within 1-10ms of the external signer processing time.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [x] Metrics
- Metric name: `apiserver_request_total` and `apiserver_request_duration_seconds`
- Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"`
- Components exposing the metric: kube-apiserver
- Aggregation method: aggregate over `job="kubernetes-apiservers",group="", version="v1",resource="serviceaccounts",subresource="token"`
- Components exposing the metric: kube-apiserver
- Metric name: `apiserver_externaljwt_sign_request_total`
- Components exposing the metric: kube-apiserver
- Metric name: `apiserver_externaljwt_request_duration_seconds`
- Components exposing the metric: kube-apiserver

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Expand All @@ -468,7 +480,6 @@ Needs Benchmarking on SLIs.
- It is the integrator's responsibility to ensure that their ExternalJWTSigner implementation support signing tokens with 1 year validity i.e. if their clusters are relying on extended token lifetimes.
- integrators can observe the `serviceaccount_stale_tokens_total` metric to confirm their cluster's reliance on `--service-account-extend-token-expiration`.


### Dependencies

One new dependency will be introduced and it will only be required for clusters configured/opted-in via the `--service-account-signing-endpoint` flag.
Expand Down Expand Up @@ -550,35 +561,46 @@ not likely.

### Troubleshooting

<!-- TODO
This section must be completed when targeting beta to a release.

For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.

The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
Symptom: kube-apiserver will not start with `--service-account-signing-endpoint` set

- check the kube-apiserver log for details about why startup failed
- ensure the socket `--service-account-signing-endpoint` points to is valid,
the kube-apiserver user has permissions to access it, and the external signer is running
- ensure `--service-account-signing-key-file` and `--service-account-key-file` are not also set
- ensure the external signer supports the version of the externaljwt gRPC API kube-apiserver is using
- ensure the maximum supported token lifetime returned by the external signer does not conflict with any
`--service-account-max-token-expiration` flag (the flag may not be longer than the max expiration supported by the external signer)

Symptom: token creation fails with `500` errors

- check `apiserver_externaljwt_sign_request_total` metrics for codes other than `OK` to determine if signing failures are the cause
- if signing requests are failing with `CANCELLED` or `DEADLINE_EXCEEDED` codes,
check `apiserver_externaljwt_request_duration_seconds` metrics for timing distribution
of external signing requests with `method=Sign`. If external signing is causing request timeouts,
investigate improving the performance of your external signer integration.
- check the kube-apiserver log for details about other signing failures

Symptom: token use fails with authentication errors

- check the `apiserver_externaljwt_fetch_keys_request_total` metrics for codes other than `OK`
to determine if verifying keys are failing to be fetched
- check the `apiserver_externaljwt_fetch_keys_success_timestamp` metric to determine the
last time public keys were successfully refreshed. If this exceeds the expected `refresh_hint_seconds`
value for your particular external signer integration, check `kube-apiserver` logs for details on why
the public key fetch is failing.
- check the `apiserver_externaljwt_fetch_keys_data_timestamp` metric to determine the `data_timestamp`
reported by the external signer in the last successful fetch of public keys. Compare to the expected
value for your particular external signer integration to determine if `kube-apiserver` is using current
public keys. If this does not match, check your external signer for details on why it is not returning
the expected public keys to the `FetchKeys` method.

###### How does this feature react if the API server and/or etcd is unavailable?

feature is only accessible via kube-apiserver. JWT signing and authentication will anyways not work without kube-apiserver.

###### What are other known failure modes?

<!-- TODO
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
How can an operator troubleshoot without logging into a control plane or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until the feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
Covered above in the troubleshooting section.

###### What steps should be taken if SLOs are not being met to determine the problem?

Expand All @@ -590,6 +612,10 @@ Initial PRs:
- kubernetes/kubernetes#73110
- kubernetes/kubernetes#125177

1.32: Alpha release

1.34: Beta release

## Drawbacks

Enabling the feature puts a remote service in the critical path of kube-apiserver. Thus, it can easily cause an outage. However, we have some relief in that it is an opt-in/configurable feature.
Expand Down
22 changes: 20 additions & 2 deletions keps/sig-auth/740-service-account-external-signing/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,11 @@ approvers:

stage: alpha

latest-milestone: "v1.33"
latest-milestone: "v1.34"

milestone:
alpha: "v1.32"
beta: "v1.34"

feature-gates:
- name: ExternalServiceAccountTokenSigner
Expand All @@ -34,4 +35,21 @@ metrics:
- serviceaccount_valid_tokens_total
- apiserver_request_duration_seconds
- serviceaccount_stale_tokens_total

# Unix Timestamp in seconds of the last successful FetchKeys data_timestamp value returned by the external signer
# Type: Gauge
- apiserver_externaljwt_fetch_keys_data_timestamp
# Total attempts at syncing supported JWKs
# Type: Counter
# Labels:code
- apiserver_externaljwt_fetch_keys_request_total
# Unix Timestamp in seconds of the last successful FetchKeys request
# Type: Gauge
- apiserver_externaljwt_fetch_keys_success_timestamp
# Request duration and time for calls to external-jwt-signer
# Type: Histogram
# Labels:code,method
- apiserver_externaljwt_request_duration_seconds
# Total attempts at signing JWT
# Type: Counter
# Labels:code
- apiserver_externaljwt_sign_request_total