Enable health checks on all Kubernetes clusters by default #60169

rana · 2025-10-11T01:04:48Z

Kubernetes health checks are enabled by default.

A per-resource health check config approach is implemented for enabling ease of adoption of new health checks, while avoiding migration of the backend database.

In this PR:

Added default_kube health check config which enables health checks on all Kubernetes clusters
Revised initialization and insert logic for health check configs

Manual testing:

Manual testing was done once with a new installation, and once with an upgrade path with the previous existing default health check config.

Testing with a new installation

Deleted /var/lib/teleport/backend/sqlite.db.
Ran teleport with PR changes

Teleport terminal output show Kubernetes health checks functioning.

2025-10-15T14:57:07.188-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:initialized message:Health checker initialized healthcheck/worker.go:423
2025-10-15T14:57:07.188-07:00 INFO [KUBERNETE] Health checker started target_name:colima target_kind:kube_cluster target_origin: health_check_config:default_kube interval:30s timeout:5s healthy_threshold:2 unhealthy_threshold:1 healthcheck/worker.go:234
2025-10-15T14:57:07.195-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411

Viewing the health check config records in the kv DB table shows a separate health check config for databases and Kubernetes.

// DB Key: /health_check_config/default
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default",
        "namespace": "default",
        "description": "Enables health checks for all databases by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "dbLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

// DB Key: /health_check_config/default_kube
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default_kube",
        "namespace": "default",
        "description": "Enables health checks for all Kubernetes clusters by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "kubernetesLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

Testing with a previous installation

Deleted /var/lib/teleport/backend/sqlite.db.
Ran teleport without PR changes

Teleport terminal output show Kubernetes health checks are disabled.

2025-10-15T15:01:58.465-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:disabled message:No health check config matches this resource healthcheck/worker.go:423

Viewing the health check config records in the kv DB table shows a single health check config for databases.

// DB Key: /health_check_config/default
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default",
        "namespace": "default",
        "description": "Enables all health checks by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "dbLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

Ran teleport with PR changes
Teleport terminal output show Kubernetes health checks functioning.

2025-10-15T15:06:52.318-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:initialized message:Health checker initialized healthcheck/worker.go:423
2025-10-15T15:06:52.318-07:00 INFO [KUBERNETE] Health checker started target_name:colima target_kind:kube_cluster target_origin: health_check_config:default_kube interval:30s timeout:5s healthy_threshold:2 unhealthy_threshold:1 healthcheck/worker.go:234
2025-10-15T15:06:52.326-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411

Viewing the health check config records in the kv DB table shows a separate health check config for databases and Kubernetes clusters.

// DB Key: /health_check_config/default
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default",
        "namespace": "default",
        "description": "Enables all health checks by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "dbLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

// DB Key: /health_check_config/default_kube
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default_kube",
        "namespace": "default",
        "description": "Enables health checks for all Kubernetes clusters by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "kubernetesLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

Relates to:

Kubernetes access health checks #58413

rosstimothy

The PR description walks through what happens when creating a new cluster, but what about a cluster that is upgraded with an existing default health_check_config, will it be amended with kubernetes_labels?

rana · 2025-10-14T22:59:36Z

The PR description walks through what happens when creating a new cluster, but what about a cluster that is upgraded with an existing default health_check_config, will it be amended with kubernetes_labels?

Default health_check_config would not amend kubernetes_labels with the current design.

rosstimothy · 2025-10-15T19:40:04Z

lib/auth/init.go

 func createPresetHealthCheckConfig(ctx context.Context, svc services.HealthCheckConfig) error {
+	// To support developing health checks for multiple resources over time,
+	// while avoiding migration of the backend database,
+	// and enabling ease of health check adoption:
+	// 	- Create a health check preset for each resource (db, kube, etc)
+
 	page, _, err := svc.ListHealthCheckConfigs(ctx, 0, "")
 	if err != nil {
 		return trace.Wrap(err, "failed listing available health check configs")
 	}
-	if len(page) > 0 {
+	if len(page) == 0 {
+		// No health check configs exist.
+		// Create all preset configs.
+		presetDB := services.NewPresetHealthCheckConfigDB()
+		_, err = svc.CreateHealthCheckConfig(ctx, presetDB)
+		if err != nil && !trace.IsAlreadyExists(err) {
+			return trace.Wrap(err,
+				"failed creating preset health_check_config %s",
+				presetDB.GetMetadata().GetName(),
+			)
+		}
+		presetKube := services.NewPresetHealthCheckConfigKube()
+		_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
+		if err != nil && !trace.IsAlreadyExists(err) {
+			return trace.Wrap(err,
+				"failed creating preset health_check_config %s",
+				presetKube.GetMetadata().GetName(),
+			)
+		}
 		return nil
+	} else {
+		// Health check configs exist.
+		// Create per-resource presets.
+		// Skip creating a DB preset; historically, it's the first, and already exists.
+
+		// Look for an existing kube preset.
+		for _, cfg := range page {
+			if cfg.GetMetadata().GetName() == teleport.PresetDefaultHealthCheckConfigKubeName {
+				return nil
+			}
+		}
+		// Create a kube preset.
+		presetKube := services.NewPresetHealthCheckConfigKube()
+		_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
+		if err != nil && !trace.IsAlreadyExists(err) {
+			return trace.Wrap(err,
+				"failed creating preset health_check_config %s",
+				presetKube.GetMetadata().GetName(),
+			)
+		}
 	}
-	preset := services.NewPresetHealthCheckConfig()
-	_, err = svc.CreateHealthCheckConfig(ctx, preset)
-	if err != nil && !trace.IsAlreadyExists(err) {
-		return trace.Wrap(err,
-			"failed creating preset health_check_config %s",
-			preset.GetMetadata().GetName(),
-		)
-	}
+


@fspmarshall would this be a good use case for AtomicWrite that assert the existing key does not exist?

rosstimothy · 2025-10-15T19:41:41Z

constants.go

+	// PresetDefaultHealthCheckConfigDBName is the name of a preset
+	// health_check_config that enables health checks for all
+	// database resources.
+	PresetDefaultHealthCheckConfigDBName = "default_db"


What are the implications of changing this name? If I already have a "default" health check config in my cluster and upgrade to a new version will I have a "default", "default_db" and "default_kube" health check? What if I've already modified the "default" config to meet my liking and now the "default_db" check adds all dbs again?

For simplicity and overall fewer headaches it might be worthwhile to revert this and leave "deafult" as the default db config.

It's a bit of a headache either way.

The idea is:

Pre-v18.4

default enables health check on all databases

default_kube enables health check on all Kubernetes clusters

v18.4 and later

default_db enables health check on all databases

default_kube enables health check on all Kubernetes clusters

Then update the docs about explaining default and default_db in different versions.

If I already have a "default" health check config in my cluster and upgrade to a new version will I have a "default", "default_db" and "default_kube" health check? What if I've already modified the "default" config to meet my liking and now the "default_db" check adds all dbs again?

Upgrading would yield default and default_kube only. Any changes to default would remain as is.
New installations would yield default_db and default_kube.

Changing default to default_db is a headache. Also explaining default is just for databases is a bit of a headache. From a customer point of view, default may seem like it's the default for all resources. In the docs we would need to explain it's for databases, and default_kube is for Kubernetes.

How ever you prefer. With that explanation, what are your thoughts on default and default_db?

I still think it will be less of a headache and less work to use default and default_kube. If it proves to be problematic and a point of contention in the future we can reconsider, but for now I would start simple and iterate.

Updated to default.

There might be a third way. I'm not sure I really like it but it works surprisingly well for installers:

There is no default resource in the backend.

When people get default and it returns trace.NotFoundErr, the auth returns the default value, this value depends on the teleport version.

When people create the default installer, it lives in the backend and is returned by the auth, overriding the version's default.

This allows both users to say "I opt out of changes, here's my own config" by creating the resource, and if they don't care we can make things evolve without user interaction.

@hugoShaka I like that approach, but it would in essence only work for new clusters right? Existing clusters would already have a default with only database labels and we're back at the same place - there is no way for existing clusters to automatically have Kubernetes health checks enabled when the upgrade from 18.N to 18.3.0.

Yes, by default only new clusters would get the new kube healthcheck on by default. I suppose that we could make a migration-like behaviour and delete the resource on auth startup if it is strictly equal to the preset we used to create (no modification were made).

That would be quite confusing if new auth and old auth instances coexisted and kept deleting and recreating the resource when restarted 😅.

In the hopes of reaching consensus and a path forward, the goal is to enable Kubernetes health checks by default, for new and existing clusters, without interfering with any human made changes to existing default health checks, especially those which may already be managed by IAC.

Adding a new default_kube config achieves that, at the expense of requiring a default per health checked resource (i.e default_app, default_deskop, etc).

We could follow the preset role mechanism and stipulate that the default is owned by Teleport and it shouldn't be modified. This will potentially effect any existing clusters that have altered the default labels.

It seems like the only objection to default_kube centers around creating multiple configs. However, it is also the only option presented so far that meets all of the above criteria. Unless there is another option that makes sense I'm inclined to say we should proceed with that route.

@r0mant since you have some context on health checks, and specifically the default health check, what are your thoughts on this?

Having separate default_<protocol> health check resources seems fine to me. It's a bit unfortunate that the database one is just called default but IMO it's not the end of the world and we can live with this inconsistency or figure out a migration if we want.

I don't think we can really prevent users from modifying them, they need to be able to disable default health checks or tweak their parameters after all. So they're not even presets strictly speaking.

The only thing is we'd need to document that if you delete default_x health check, it will be recreated by the cluster, so if you want to change their behavior, you should edit the resource, not delete it.

smallinsky

Left few suggestions

constants.go

smallinsky · 2025-10-17T13:43:16Z

lib/auth/init.go


 // createPresetHealthCheckConfig creates a default preset health check config
 // resource that enables health checks on all resources.
 func createPresetHealthCheckConfig(ctx context.Context, svc services.HealthCheckConfig) error {


nit: docs updated

Nice catch, comments are updated.

teleport/lib/auth/init.go

Lines 1500 to 1502 in f849c30

// createPresetHealthCheckConfigs creates a preset health check config

// for each resource using the healthcheck package.

func createPresetHealthCheckConfigs(ctx context.Context, svc services.HealthCheckConfig) error {

smallinsky · 2025-10-17T13:46:10Z

lib/auth/init.go

+	if len(page) == 0 {
+		// No health check configs exist.
+		// Create all preset configs.
+		presetDB := services.NewPresetHealthCheckConfigDB()
+		_, err = svc.CreateHealthCheckConfig(ctx, presetDB)
+		if err != nil && !trace.IsAlreadyExists(err) {
+			return trace.Wrap(err,
+				"failed creating preset health_check_config %s",
+				presetDB.GetMetadata().GetName(),
+			)
+		}
+		presetKube := services.NewPresetHealthCheckConfigKube()
+		_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
+		if err != nil && !trace.IsAlreadyExists(err) {
+			return trace.Wrap(err,
+				"failed creating preset health_check_config %s",
+				presetKube.GetMetadata().GetName(),
+			)
+		}
 		return nil
+	} else {
+		// Health check configs exist.
+		// Create per-resource presets.
+		// Skip creating a DB preset; historically, it's the first, and already exists.
+
+		// Look for an existing kube preset.
+		for _, cfg := range page {
+			if cfg.GetMetadata().GetName() == teleport.PresetDefaultHealthCheckConfigKubeName {
+				return nil
+			}
+		}
+		// Create a kube preset.
+		presetKube := services.NewPresetHealthCheckConfigKube()
+		_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
+		if err != nil && !trace.IsAlreadyExists(err) {
+			return trace.Wrap(err,
+				"failed creating preset health_check_config %s",
+				presetKube.GetMetadata().GetName(),
+			)
+		}


If we want to build per layer preset can we consider building it in generic way like:

var presetsCreators = []func() *healthcheckconfigv1.HealthCheckConfig{ services.NewPresetHealthCheckConfigDB, services.NewPresetHealthCheckConfigKube, // Easy for future extension } func createPresetHealthCheckConfig(ctx context.Context, svc services.HealthCheckConfig) error { existing := make(map[string]bool) allHealthCheckConfigs, err := stream.Collect(clienutils.Resouces(ctx, svc.ListHealthCheckConfigs)) if err != nil { return trace.Wrap(err, "failed listing available health check configs") } for _, cfg := range allHealthCheckConfigs { existing[cfg.GetMetadata().GetName()] = true } for _, createFn := range presetsCreators { preset := createFn() if !existing[preset.GetMetadata().GetName()] { if err := createPresetIfNotExists(ctx, svc, preset); err != nil { return err } } } return nil }

Very nice generalized form for now and in the future.

Implemented with slight adjustments.

teleport/lib/auth/init.go

Lines 1502 to 1528 in f849c30

func createPresetHealthCheckConfigs(ctx context.Context, svc services.HealthCheckConfig) error {

// The choice to create a preset per-resource is motivated by:

// - Supporting existing Teleport clusters already using health checks with some resources

// - Avoiding migration of the backend database, which avoids downtime and headaches

// - Easing the adoption of health checks for new resources as they are developed over time

exists := make(map[string]bool)

cfgs, err := stream.Collect(clientutils.Resources(ctx, svc.ListHealthCheckConfigs))

if err != nil {

return trace.Wrap(err, "unable to list health check configs")

}

for _, cfg := range cfgs {

exists[cfg.GetMetadata().GetName()] = true

}

var errs []error

for name, newPreset := range newHealthPresets {

if !exists[name] {

if _, err = svc.CreateHealthCheckConfig(ctx, newPreset()); err != nil && !trace.IsAlreadyExists(err) {

errs = append(errs, err)

}

}

}

if len(errs) > 0 {

return trace.NewAggregate(errs...)

}

return nil

}

A per-resource health check config approach is implemented for enabling ease of adoption of new health checks, while avoiding migration of the backend database. Changes: - Added `default_kube` health check config which enables health checks on all Kubernetes clusters - Revised initialization and insert logic for health check configs Part of #58413

rana added kubernetes-access no-changelog Indicates that a PR does not require a changelog entry health-check Resource health check related labels Oct 11, 2025

rana mentioned this pull request Oct 11, 2025

Kubernetes access health checks #58413

Open

rana marked this pull request as ready for review October 11, 2025 01:27

github-actions bot requested review from gabrielcorado and smallinsky October 11, 2025 01:27

github-actions bot added the size/sm label Oct 11, 2025

rana requested a review from rosstimothy October 11, 2025 01:29

rosstimothy reviewed Oct 14, 2025

View reviewed changes

rana force-pushed the rana/kube-healthchecks-10 branch 3 times, most recently from c6b8fdb to daa8aee Compare October 15, 2025 19:04

rosstimothy reviewed Oct 15, 2025

View reviewed changes

rana force-pushed the rana/kube-healthchecks-10 branch from daa8aee to a1610eb Compare October 15, 2025 21:47

rana requested a review from GavinFrazar October 15, 2025 22:18

smallinsky reviewed Oct 17, 2025

View reviewed changes

rosstimothy mentioned this pull request Oct 17, 2025

Add documentation for Kubernetes health checks #60201

Open

rana force-pushed the rana/kube-healthchecks-10 branch from a1610eb to f849c30 Compare October 18, 2025 00:12

rana force-pushed the rana/kube-healthchecks-10 branch from f849c30 to a0c446f Compare October 18, 2025 00:30

rana requested review from hugoShaka, rosstimothy and smallinsky October 18, 2025 00:45

	// createPresetHealthCheckConfigs creates a preset health check config
	// for each resource using the healthcheck package.
	func createPresetHealthCheckConfigs(ctx context.Context, svc services.HealthCheckConfig) error {

	func createPresetHealthCheckConfigs(ctx context.Context, svc services.HealthCheckConfig) error {
	// The choice to create a preset per-resource is motivated by:
	// - Supporting existing Teleport clusters already using health checks with some resources
	// - Avoiding migration of the backend database, which avoids downtime and headaches
	// - Easing the adoption of health checks for new resources as they are developed over time
	exists := make(map[string]bool)
	cfgs, err := stream.Collect(clientutils.Resources(ctx, svc.ListHealthCheckConfigs))
	if err != nil {
	return trace.Wrap(err, "unable to list health check configs")
	}
	for _, cfg := range cfgs {
	exists[cfg.GetMetadata().GetName()] = true
	}
	var errs []error
	for name, newPreset := range newHealthPresets {
	if !exists[name] {
	if _, err = svc.CreateHealthCheckConfig(ctx, newPreset()); err != nil && !trace.IsAlreadyExists(err) {
	errs = append(errs, err)
	}
	}
	}
	if len(errs) > 0 {
	return trace.NewAggregate(errs...)
	}
	return nil
	}

Enable health checks on all Kubernetes clusters by default #60169

Are you sure you want to change the base?

Enable health checks on all Kubernetes clusters by default #60169

Uh oh!

Conversation

rana commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosstimothy left a comment

Choose a reason for hiding this comment

Uh oh!

rana commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rana Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hugoShaka Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallinsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rana Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallinsky Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rana commented Oct 11, 2025 •

edited

Loading

rana Oct 15, 2025 •

edited

Loading

hugoShaka Oct 17, 2025 •

edited

Loading

rana Oct 18, 2025 •

edited

Loading

smallinsky Oct 17, 2025 •

edited

Loading