Skip to content

Commit 67d0bc2

Browse files
committed
Add documentation for Kubernetes health checks
1 parent 4103cb5 commit 67d0bc2

File tree

4 files changed

+401
-0
lines changed

4 files changed

+401
-0
lines changed
82.6 KB
Loading
229 KB
Loading
Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
---
2+
title: Teleport Kubernetes Health Checks
3+
sidebar_label: Health Checks
4+
description: How to configure Teleport Kubernetes health checks and view health.
5+
tags:
6+
- conceptual
7+
- zero-trust
8+
- infrastructure-identity
9+
---
10+
11+
This documentation provides an overview of health checks to Kubernetes clusters with Teleport. Teleport Kubernetes Services periodically check the connectivity and permissions of enrolled Kubernetes clusters, and is available in Teleport version `18.4` and later.
12+
13+
## Why check the health of a Kubernetes cluster?
14+
15+
- **Observability**: Discover network and permission issues before users do. Unhealthy Kubernetes clusters are visible in a Teleport UI, command line tool, or Prometheus metrics.
16+
- **High Availability**: Automatically route and distribute connections to healthy Kubernetes clusters in a high-availability configuration.
17+
18+
## What's checked?
19+
20+
Kubernetes permissions and a Kubernetes health endpoint are checked to determine whether a Kubernetes cluster is both up and usable with Teleport.
21+
22+
Four Kubernetes RBAC permissions are routinely checked with the Kubernetes [SelfSubjectAccessReview](https://kubernetes.io/docs/reference/kubernetes-api/authorization-resources/self-subject-access-review-v1/) API. The permissions are part of minimum requirements for Teleport to work with a Kubernetes cluster. The checked permissions are:
23+
- Impersonate users
24+
- Impersonate groups
25+
- Impersonate service accounts
26+
- Get pods
27+
28+
If a permission can't be checked, the Kubernetes cluster's [/readyz](https://kubernetes.io/docs/reference/using-api/health-checks/) endpoint is called to further distinguish connection errors and Kubernetes component errors.
29+
30+
## Health states
31+
32+
A Kubernetes cluster is in a `healthy`, `unhealthy`, or `unknown` state.
33+
- `healthy` indicates a Kubernetes cluster's health was checked and is fine
34+
- `unhealthy` indicates a Kubernetes cluster's health was checked and is not usable for some reason
35+
- `unknown` indicates a Kubernetes cluster has been excluded from health checks, or the first health check is initializing
36+
37+
<Admonition type="warning">
38+
39+
`unknown` Kubernetes clusters may be unhealthy.
40+
41+
`unknown` Kubernetes clusters are not checked for health due to:
42+
- Running a pre-`18.4` version of Teleport Kubernetes Service
43+
- Explicitly configuring `health_check_config` labels to exclude a Kubernetes cluster
44+
</Admonition>
45+
46+
## Viewing health
47+
48+
Kubernetes cluster health is viewed through the Teleport web UI, desktop Connect UI, `tctl` CLI tool, or Prometheus metrics.
49+
50+
**Teleport Web & Connect UI**
51+
52+
Click the refresh icon to get the latest health of Kubernetes clusters.
53+
54+
![Kubernetes health warning in the UI](../../../img/resource-health-check/kubernetes-health-warning.png)
55+
56+
**Teleport `tctl` CLI**
57+
58+
Run `tctl get kube_server/<your-kube-server-name>` for an overview of Kubernetes cluster health for a specific Kubernetes service.
59+
```yaml
60+
kind: kube_server
61+
...
62+
status:
63+
target_health:
64+
address: 192.168.106.2:58458
65+
message: 1 health check passed
66+
protocol: http
67+
status: healthy
68+
transition_reason: threshold_reached
69+
transition_timestamp: "2025-10-13T19:26:58.842855Z"
70+
version: v3
71+
```
72+
73+
Teleport Prometheus metrics
74+
75+
Health check metrics offer a high-level view of Kubernetes cluster health. The total number of Kubernetes clusters in a `healthy`, `unhealthy`, or `unknown` state are monitored with gauge metrics `teleport_resources_health_status_healthy`, `teleport_resources_health_status_unhealthy`, and `teleport_resources_health_status_unknown`.
76+
77+
- `teleport_resources_health_status_healthy{type="kubernetes"}` is the total number of _healthy_ Kubernetes clusters
78+
- `teleport_resources_health_status_unhealthy{type="kubernetes"}` is the total number of _unhealthy_ Kubernetes clusters
79+
- `teleport_resources_health_status_unknown{type="kubernetes"}` is the total number of Kubernetes clusters in an _unknown_ state
80+
81+
A [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) expression may be used to determine the total number of Kubernetes clusters.
82+
```promql
83+
teleport_resources_health_status_healthy{type="kubernetes"} +
84+
teleport_resources_health_status_unhealthy{type="kubernetes"} +
85+
teleport_resources_health_status_unknown{type="kubernetes"}
86+
```
87+
88+
A PromQL expression may be used to detect the presence of unhealthy Kubernetes clusters.
89+
```promql
90+
teleport_resources_health_status_unhealthy{type="kubernetes"} > 0
91+
```
92+
93+
<Admonition type="note">
94+
Notice that Prometheus metrics don't distinguish which cluster is unhealthy.
95+
96+
Use a Teleport UI, or `tctl kube ls --query 'health.status == "unhealthy"'` to determine which cluster is unhealthy.
97+
</Admonition>
98+
99+
The Teleport `tctl top` command displays Kubernetes health check metrics.
100+
101+
Steps to view health check metrics with `tctl`:
102+
- Run `tctl top`
103+
- Navigate to the "Raw Metrics" tab by pressing the right arrow `→` key several times
104+
- Enter search mode by pressing the forward slash `/` key
105+
- Type or paste `teleport_resources_health_status`
106+
107+
![Health metrics with the tctl top command](../../../img/resource-health-check/kubernetes-health-metrics-tctl.png)
108+
109+
Health check metrics may also be viewed with the Teleport diagnostic endpoint `http://<diagnostic-address>/metrics`.
110+
```text
111+
# HELP teleport_resources_health_status_healthy Number of healthy resources
112+
# TYPE teleport_resources_health_status_healthy gauge
113+
teleport_resources_health_status_healthy{type="kubernetes"} 99972
114+
# HELP teleport_resources_health_status_unhealthy Number of unhealthy resources
115+
# TYPE teleport_resources_health_status_unhealthy gauge
116+
teleport_resources_health_status_unhealthy{type="kubernetes"} 3
117+
# HELP teleport_resources_health_status_unknown Number of resources in an unknown health state
118+
# TYPE teleport_resources_health_status_unknown gauge
119+
teleport_resources_health_status_unknown{type="kubernetes"} 25
120+
```
121+
122+
## Configuring health checks
123+
124+
The Teleport `tctl` CLI tool enables reading, adding, editing, and deleting `health_check_config` resources.
125+
126+
`health_check_config` resources offer a way to configure and selectively apply health checks to Kubernetes clusters.
127+
128+
An example `health_check_config`.
129+
```yaml
130+
kind: health_check_config
131+
version: v1
132+
metadata:
133+
name: example
134+
description: Example healthcheck configuration
135+
spec:
136+
# interval is the time between each health check. Default 30s.
137+
interval: 30s
138+
# timeout is the health check connection establishment timeout. Default 5s.
139+
timeout: 5s
140+
# healthy_threshold is the number of consecutive passing health checks
141+
# after which a target's health status becomes "healthy". Default 2.
142+
healthy_threshold: 2
143+
# unhealthy_threshold is the number of consecutive failing health checks
144+
# after which a target's health status becomes "unhealthy". Default 1.
145+
unhealthy_threshold: 1
146+
# match is used to select Kubernetes clusters that apply these settings.
147+
# Kubernetes clusters are matched by label selectors and at least one label selector
148+
# must be set.
149+
# If multiple `health_check_config` resources match the same Kubernetes cluster,
150+
# the matching health check configs are sorted by name and only the first
151+
# config applies.
152+
match:
153+
# kubernetes_labels matches Kubernetes cluster labels. An empty value is ignored.
154+
# If kubernetes_labels_expression is also set, then the match result is the logical
155+
# AND of both.
156+
kubernetes_labels:
157+
- name: env
158+
values:
159+
- dev
160+
- staging
161+
# kubernetes_labels_expression is a label predicate expression to match Kubernetes clusters.
162+
# An empty value is ignored.
163+
# If kubernetes_labels is also set, then the match result is the logical AND of both.
164+
kubernetes_labels_expression: 'labels["owner"] == "platform-team"'
165+
```
166+
167+
The default `health_check_config` enables all Kubernetes clusters to participate in health checks from version `18.4` onward.
168+
```yaml
169+
kind: health_check_config
170+
metadata:
171+
description: Enables all health checks by default
172+
labels:
173+
teleport.internal/resource-type: preset
174+
name: default
175+
namespace: default
176+
spec:
177+
match:
178+
kubernetes_labels:
179+
- name: '*'
180+
values:
181+
- '*'
182+
version: v1
183+
```
184+
185+
Multiple different `health_check_config` resources may be created for different groups of Kubernetes clusters. When multiple `health_check_config` match the same Kubernetes cluster, configs are sorted in ascending order by name, and only the first config applies (e.g., the name "00-my-config" has greater precedence than "10-my-config").
186+
187+
**`tctl` health check commands**
188+
189+
Read the default health check config with `tctl get`:
190+
```bash
191+
$ tctl get health_check_config/default
192+
```
193+
194+
Create a new health check config with `tctl create`:
195+
```bash
196+
$ tctl create health_check_config.yaml
197+
```
198+
199+
Update an existing config interactively with `tctl edit`:
200+
```bash
201+
$ tctl edit health_check_config/default
202+
```
203+
204+
Delete a health check config with `tctl rm`:
205+
```bash
206+
$ tctl rm health_check_config/example
207+
```
208+
209+
Teleport Kubernetes Services are notified of changes to `health_check_config`, and reevaluate whether a Kubernetes cluster participates in health checks, applying any changes.
210+
211+
## FAQ
212+
213+
### How do I see unhealthy Kubernetes clusters?
214+
215+
The Teleport web and Connect UI highlight unhealthy Kubernetes clusters.
216+
217+
Clicking a highlighted Kubernetes cluster shows details of an unhealthy Kubernetes cluster.
218+
219+
![Kubernetes health warning in the UI](../../../img/resource-health-check/kubernetes-health-warning)
220+
221+
It may take approximately `5m` for a health change to be reported.
222+
223+
Click the circular arrow refresh icon to get the latest health status.
224+
225+
The Teleport `tctl` CLI tool also searches and displays unhealthy Kubernetes clusters.
226+
```code
227+
tctl kube ls --query 'health.status == "unhealthy"'
228+
```
229+
230+
### What guidance is there for troubleshooting unhealthy Kubernetes clusters?
231+
232+
See the [Kubernetes Service troubleshooting guide](./troubleshooting.mdx) for specific errors returned by health checks.
233+
234+
### How do I disable Kubernetes health checks?
235+
236+
Add a filtering label to exclude Kuberenetes clusters with the CLI command `tctl edit health_check_config/default`. If you have multiple `health_check_config`, such as `health_check_config/example-a`, `health_check_config/example-b`, each config would be adjusted.
237+
238+
Adding any label to `health_check_config` tells Teleport to filter out Kubernetes clusters which don't match the label.
239+
240+
For example, adding `healthcheck` and `enabled` to `kubernetes_labels` in the following example would exclude Kubernetes clusters without the label.
241+
```yaml
242+
kind: health_check_config
243+
metadata:
244+
description: Enables all health checks by default
245+
labels:
246+
teleport.internal/resource-type: preset
247+
name: default
248+
namespace: default
249+
revision: 040e3839-c248-4b9d-898f-4ae88b1d4752
250+
spec:
251+
match:
252+
kubernetes_labels:
253+
- name: 'healthcheck'
254+
values:
255+
- 'enabled'
256+
version: v1
257+
```
258+
259+
The default is to match all enrolled Kubernetes clusters.
260+
261+
```yaml
262+
spec:
263+
match:
264+
kubernetes_labels:
265+
- name: '*'
266+
values:
267+
- '*'
268+
```
269+
270+
### Do health check metrics show which Kubernetes cluster is unhealthy?
271+
272+
No. A specific Kubernetes cluster's health cannot be determined from the `teleport_resources_health_status_*` health metrics.
273+
274+
The quantity of unhealthy Kubernetes clusters are available from metrics.
275+
276+
### How do I configure health checks for high-availability?
277+
278+
No additional configuration is needed.
279+
280+
Health-based connection routing is automatic when multiple Teleport Kubernetes Services are proxying to the same Kubernetes cluster.
281+
282+
Configuring high-availability with multiple Teleport Kubernetes Services proxying the same Kubernetes cluster would be needed.
283+

0 commit comments

Comments
 (0)