Skip to content

Conversation

rana
Copy link
Contributor

@rana rana commented Oct 11, 2025

Kubernetes health checks are enabled by default.

A per-resource health check config approach is implemented for enabling ease of adoption of new health checks, while avoiding migration of the backend database.

In this PR:

  • Added default_kube health check config which enables health checks on all Kubernetes clusters
  • Revised initialization and insert logic for health check configs

Manual testing:

Manual testing was done once with a new installation, and once with an upgrade path with the previous existing default health check config.

Testing with a new installation

  • Deleted /var/lib/teleport/backend/sqlite.db.
  • Ran teleport with PR changes

Teleport terminal output show Kubernetes health checks functioning.

2025-10-15T14:57:07.188-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:initialized message:Health checker initialized healthcheck/worker.go:423
2025-10-15T14:57:07.188-07:00 INFO [KUBERNETE] Health checker started target_name:colima target_kind:kube_cluster target_origin: health_check_config:default_kube interval:30s timeout:5s healthy_threshold:2 unhealthy_threshold:1 healthcheck/worker.go:234
2025-10-15T14:57:07.195-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411
  • Viewing the health check config records in the kv DB table shows a separate health check config for databases and Kubernetes.
// DB Key: /health_check_config/default
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default",
        "namespace": "default",
        "description": "Enables health checks for all databases by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "dbLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

// DB Key: /health_check_config/default_kube
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default_kube",
        "namespace": "default",
        "description": "Enables health checks for all Kubernetes clusters by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "kubernetesLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

Testing with a previous installation

  • Deleted /var/lib/teleport/backend/sqlite.db.
  • Ran teleport without PR changes

Teleport terminal output show Kubernetes health checks are disabled.

2025-10-15T15:01:58.465-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:disabled message:No health check config matches this resource healthcheck/worker.go:423
  • Viewing the health check config records in the kv DB table shows a single health check config for databases.
// DB Key: /health_check_config/default
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default",
        "namespace": "default",
        "description": "Enables all health checks by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "dbLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}
  • Ran teleport with PR changes
    Teleport terminal output show Kubernetes health checks functioning.
2025-10-15T15:06:52.318-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:initialized message:Health checker initialized healthcheck/worker.go:423
2025-10-15T15:06:52.318-07:00 INFO [KUBERNETE] Health checker started target_name:colima target_kind:kube_cluster target_origin: health_check_config:default_kube interval:30s timeout:5s healthy_threshold:2 unhealthy_threshold:1 healthcheck/worker.go:234
2025-10-15T15:06:52.326-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411
  • Viewing the health check config records in the kv DB table shows a separate health check config for databases and Kubernetes clusters.
// DB Key: /health_check_config/default
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default",
        "namespace": "default",
        "description": "Enables all health checks by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "dbLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

// DB Key: /health_check_config/default_kube
{
    "kind": "health_check_config",
    "version": "v1",
    "metadata": {
        "name": "default_kube",
        "namespace": "default",
        "description": "Enables health checks for all Kubernetes clusters by default",
        "labels": {
            "teleport.internal/resource-type": "preset"
        }
    },
    "spec": {
        "match": {
            "kubernetesLabels": [
                {
                    "name": "*",
                    "values": [
                        "*"
                    ]
                }
            ]
        }
    }
}

Relates to:

@rana rana added kubernetes-access no-changelog Indicates that a PR does not require a changelog entry health-check Resource health check related labels Oct 11, 2025
@rana rana marked this pull request as ready for review October 11, 2025 01:27
@rana rana requested a review from rosstimothy October 11, 2025 01:29
Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description walks through what happens when creating a new cluster, but what about a cluster that is upgraded with an existing default health_check_config, will it be amended with kubernetes_labels?

@rana
Copy link
Contributor Author

rana commented Oct 14, 2025

The PR description walks through what happens when creating a new cluster, but what about a cluster that is upgraded with an existing default health_check_config, will it be amended with kubernetes_labels?

Default health_check_config would not amend kubernetes_labels with the current design.

@rana rana force-pushed the rana/kube-healthchecks-10 branch 3 times, most recently from c6b8fdb to daa8aee Compare October 15, 2025 19:04
lib/auth/init.go Outdated
Comment on lines 1491 to 1542
func createPresetHealthCheckConfig(ctx context.Context, svc services.HealthCheckConfig) error {
// To support developing health checks for multiple resources over time,
// while avoiding migration of the backend database,
// and enabling ease of health check adoption:
// - Create a health check preset for each resource (db, kube, etc)

page, _, err := svc.ListHealthCheckConfigs(ctx, 0, "")
if err != nil {
return trace.Wrap(err, "failed listing available health check configs")
}
if len(page) > 0 {
if len(page) == 0 {
// No health check configs exist.
// Create all preset configs.
presetDB := services.NewPresetHealthCheckConfigDB()
_, err = svc.CreateHealthCheckConfig(ctx, presetDB)
if err != nil && !trace.IsAlreadyExists(err) {
return trace.Wrap(err,
"failed creating preset health_check_config %s",
presetDB.GetMetadata().GetName(),
)
}
presetKube := services.NewPresetHealthCheckConfigKube()
_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
if err != nil && !trace.IsAlreadyExists(err) {
return trace.Wrap(err,
"failed creating preset health_check_config %s",
presetKube.GetMetadata().GetName(),
)
}
return nil
} else {
// Health check configs exist.
// Create per-resource presets.
// Skip creating a DB preset; historically, it's the first, and already exists.

// Look for an existing kube preset.
for _, cfg := range page {
if cfg.GetMetadata().GetName() == teleport.PresetDefaultHealthCheckConfigKubeName {
return nil
}
}
// Create a kube preset.
presetKube := services.NewPresetHealthCheckConfigKube()
_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
if err != nil && !trace.IsAlreadyExists(err) {
return trace.Wrap(err,
"failed creating preset health_check_config %s",
presetKube.GetMetadata().GetName(),
)
}
}
preset := services.NewPresetHealthCheckConfig()
_, err = svc.CreateHealthCheckConfig(ctx, preset)
if err != nil && !trace.IsAlreadyExists(err) {
return trace.Wrap(err,
"failed creating preset health_check_config %s",
preset.GetMetadata().GetName(),
)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fspmarshall would this be a good use case for AtomicWrite that assert the existing key does not exist?

constants.go Outdated
Comment on lines 778 to 781
// PresetDefaultHealthCheckConfigDBName is the name of a preset
// health_check_config that enables health checks for all
// database resources.
PresetDefaultHealthCheckConfigDBName = "default_db"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of changing this name? If I already have a "default" health check config in my cluster and upgrade to a new version will I have a "default", "default_db" and "default_kube" health check? What if I've already modified the "default" config to meet my liking and now the "default_db" check adds all dbs again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity and overall fewer headaches it might be worthwhile to revert this and leave "deafult" as the default db config.

Copy link
Contributor Author

@rana rana Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a headache either way.

The idea is:

  • Pre-v18.4
    • default enables health check on all databases
    • default_kube enables health check on all Kubernetes clusters
  • v18.4 and later
    • default_db enables health check on all databases
    • default_kube enables health check on all Kubernetes clusters

Then update the docs about explaining default and default_db in different versions.

If I already have a "default" health check config in my cluster and upgrade to a new version will I have a "default", "default_db" and "default_kube" health check? What if I've already modified the "default" config to meet my liking and now the "default_db" check adds all dbs again?

Upgrading would yield default and default_kube only. Any changes to default would remain as is.
New installations would yield default_db and default_kube.

Changing default to default_db is a headache. Also explaining default is just for databases is a bit of a headache. From a customer point of view, default may seem like it's the default for all resources. In the docs we would need to explain it's for databases, and default_kube is for Kubernetes.

How ever you prefer. With that explanation, what are your thoughts on default and default_db?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it will be less of a headache and less work to use default and default_kube. If it proves to be problematic and a point of contention in the future we can reconsider, but for now I would start simple and iterate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to default.

Copy link
Contributor

@hugoShaka hugoShaka Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a third way. I'm not sure I really like it but it works surprisingly well for installers:

  • There is no default resource in the backend.
  • When people get default and it returns trace.NotFoundErr, the auth returns the default value, this value depends on the teleport version.
  • When people create the default installer, it lives in the backend and is returned by the auth, overriding the version's default.

This allows both users to say "I opt out of changes, here's my own config" by creating the resource, and if they don't care we can make things evolve without user interaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugoShaka I like that approach, but it would in essence only work for new clusters right? Existing clusters would already have a default with only database labels and we're back at the same place - there is no way for existing clusters to automatically have Kubernetes health checks enabled when the upgrade from 18.N to 18.3.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, by default only new clusters would get the new kube healthcheck on by default. I suppose that we could make a migration-like behaviour and delete the resource on auth startup if it is strictly equal to the preset we used to create (no modification were made).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be quite confusing if new auth and old auth instances coexisted and kept deleting and recreating the resource when restarted 😅.

In the hopes of reaching consensus and a path forward, the goal is to enable Kubernetes health checks by default, for new and existing clusters, without interfering with any human made changes to existing default health checks, especially those which may already be managed by IAC.

  • Adding a new default_kube config achieves that, at the expense of requiring a default per health checked resource (i.e default_app, default_deskop, etc).
  • We could follow the preset role mechanism and stipulate that the default is owned by Teleport and it shouldn't be modified. This will potentially effect any existing clusters that have altered the default labels.

It seems like the only objection to default_kube centers around creating multiple configs. However, it is also the only option presented so far that meets all of the above criteria. Unless there is another option that makes sense I'm inclined to say we should proceed with that route.

@r0mant since you have some context on health checks, and specifically the default health check, what are your thoughts on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having separate default_<protocol> health check resources seems fine to me. It's a bit unfortunate that the database one is just called default but IMO it's not the end of the world and we can live with this inconsistency or figure out a migration if we want.

I don't think we can really prevent users from modifying them, they need to be able to disable default health checks or tweak their parameters after all. So they're not even presets strictly speaking.

The only thing is we'd need to document that if you delete default_x health check, it will be recreated by the cluster, so if you want to change their behavior, you should edit the resource, not delete it.

@rana rana force-pushed the rana/kube-healthchecks-10 branch from daa8aee to a1610eb Compare October 15, 2025 21:47
@rana rana requested a review from GavinFrazar October 15, 2025 22:18
Copy link
Contributor

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few suggestions

lib/auth/init.go Outdated

// createPresetHealthCheckConfig creates a default preset health check config
// resource that enables health checks on all resources.
func createPresetHealthCheckConfig(ctx context.Context, svc services.HealthCheckConfig) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: docs updated

Copy link
Contributor Author

@rana rana Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, comments are updated.

teleport/lib/auth/init.go

Lines 1500 to 1502 in f849c30

// createPresetHealthCheckConfigs creates a preset health check config
// for each resource using the healthcheck package.
func createPresetHealthCheckConfigs(ctx context.Context, svc services.HealthCheckConfig) error {

lib/auth/init.go Outdated
Comment on lines 1501 to 1540
if len(page) == 0 {
// No health check configs exist.
// Create all preset configs.
presetDB := services.NewPresetHealthCheckConfigDB()
_, err = svc.CreateHealthCheckConfig(ctx, presetDB)
if err != nil && !trace.IsAlreadyExists(err) {
return trace.Wrap(err,
"failed creating preset health_check_config %s",
presetDB.GetMetadata().GetName(),
)
}
presetKube := services.NewPresetHealthCheckConfigKube()
_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
if err != nil && !trace.IsAlreadyExists(err) {
return trace.Wrap(err,
"failed creating preset health_check_config %s",
presetKube.GetMetadata().GetName(),
)
}
return nil
} else {
// Health check configs exist.
// Create per-resource presets.
// Skip creating a DB preset; historically, it's the first, and already exists.

// Look for an existing kube preset.
for _, cfg := range page {
if cfg.GetMetadata().GetName() == teleport.PresetDefaultHealthCheckConfigKubeName {
return nil
}
}
// Create a kube preset.
presetKube := services.NewPresetHealthCheckConfigKube()
_, err = svc.CreateHealthCheckConfig(ctx, presetKube)
if err != nil && !trace.IsAlreadyExists(err) {
return trace.Wrap(err,
"failed creating preset health_check_config %s",
presetKube.GetMetadata().GetName(),
)
}
Copy link
Contributor

@smallinsky smallinsky Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to build per layer preset can we consider building it in generic way like:

var presetsCreators = []func() *healthcheckconfigv1.HealthCheckConfig{
    services.NewPresetHealthCheckConfigDB,
    services.NewPresetHealthCheckConfigKube,
    // Easy for future extension 
}

func createPresetHealthCheckConfig(ctx context.Context, svc services.HealthCheckConfig) error {
    existing := make(map[string]bool)
    allHealthCheckConfigs, err :=  stream.Collect(clienutils.Resouces(ctx, svc.ListHealthCheckConfigs))
    if err != nil {
        return trace.Wrap(err, "failed listing available health check configs")
    }
    for _, cfg := range allHealthCheckConfigs {
        existing[cfg.GetMetadata().GetName()] = true
    }
    
    for _, createFn := range presetsCreators {
        preset := createFn()
        if !existing[preset.GetMetadata().GetName()] {
            if err := createPresetIfNotExists(ctx, svc, preset); err != nil {
                return err
            }
        }
    }
    return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice generalized form for now and in the future.

Implemented with slight adjustments.

teleport/lib/auth/init.go

Lines 1502 to 1528 in f849c30

func createPresetHealthCheckConfigs(ctx context.Context, svc services.HealthCheckConfig) error {
// The choice to create a preset per-resource is motivated by:
// - Supporting existing Teleport clusters already using health checks with some resources
// - Avoiding migration of the backend database, which avoids downtime and headaches
// - Easing the adoption of health checks for new resources as they are developed over time
exists := make(map[string]bool)
cfgs, err := stream.Collect(clientutils.Resources(ctx, svc.ListHealthCheckConfigs))
if err != nil {
return trace.Wrap(err, "unable to list health check configs")
}
for _, cfg := range cfgs {
exists[cfg.GetMetadata().GetName()] = true
}
var errs []error
for name, newPreset := range newHealthPresets {
if !exists[name] {
if _, err = svc.CreateHealthCheckConfig(ctx, newPreset()); err != nil && !trace.IsAlreadyExists(err) {
errs = append(errs, err)
}
}
}
if len(errs) > 0 {
return trace.NewAggregate(errs...)
}
return nil
}

A per-resource health check config approach is implemented for enabling ease of adoption of new health checks, while avoiding migration of the backend database.

Changes:
- Added `default_kube` health check config which enables health checks on all Kubernetes clusters
- Revised initialization and insert logic for health check configs

Part of #58413
@rana rana force-pushed the rana/kube-healthchecks-10 branch from f849c30 to a0c446f Compare October 18, 2025 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

health-check Resource health check related kubernetes-access no-changelog Indicates that a PR does not require a changelog entry size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants