-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Expected Behavior
The tekton_pipelines_controller_taskruns_pod_latency_milliseconds metric should have bounded cardinality, similar to other controller metrics like taskrun_duration_seconds which respect the metrics.taskrun.level configuration. The pod label should either be removable via configuration or not included by default, since pod names are unique per TaskRun and create unbounded time series.
Actual Behavior
The metric includes a hardcoded pod tag key in the view definition that records the unique pod name for every TaskRun via tag.Insert(podTag, pod.Name). Combined with view.LastValue() aggregation (which retains all label sets indefinitely), this creates O(N) time series where N = total TaskRuns ever created.
On busy clusters (e.g., Konflux production), this results in:
- 1.2 million+ active time series from this single metric
- Prometheus ServiceMonitor scraping timeouts
- Complete loss of ALL Tekton metrics (not just this one) because the scrape endpoint becomes too large to read within the timeout
The metric is also undocumented — it does not appear in docs/metrics.md or on https://tekton.dev/docs/pipelines/metrics/
Evidence:
curl http://localhost:9095/metrics | grep tekton_pipelines_controller_taskruns_pod_latency_milliseconds | wc -l
# Returns: 1213602 active series
Steps to Reproduce the Problem
- Deploy Tekton Pipelines on a cluster with moderate-to-heavy TaskRun volume
- Run many TaskRuns over time (each creates a unique pod)
- Query the metrics endpoint:
curl http://localhost:9095/metrics | grep tekton_pipelines_controller_taskruns_pod_latency_milliseconds | wc -l - Observe the series count grows linearly with total TaskRuns and never decreases
- Eventually, Prometheus scrape timeouts occur and ALL tekton metrics are lost
Root cause in code:
- Metric defined:
pkg/taskrunmetrics/metrics.go:97-99—podLatencyasstats.Float64 - Pod tag key:
pkg/taskrunmetrics/metrics.go:56—podTag = tag.MustNewKey("pod") - View with hardcoded pod label:
pkg/taskrunmetrics/metrics.go:251-256—podLatencyViewwithpodTaginTagKeys - Recording with pod.Name:
pkg/taskrunmetrics/metrics.go:556—tag.Insert(podTag, pod.Name) - Called from:
pkg/reconciler/taskrun/taskrun.go:721—c.metrics.RecordPodLatency(ctx, pod, tr) - No config disable:
pkg/apis/config/metrics.go:101-108— theMetricsstruct has no field for disabling pod latency metrics.taskrun.level(line 26) only affectstask/taskrunlabels, NOT thepodlabel
Additional Info
- Kubernetes version: v1.30.14 (OCP 4.17.45)
- Tekton Pipeline version: v5.0.5-769 (openshift-pipelines-operator-rh), upstream equivalent varies
- The metric was originally
taskruns_pod_latency, renamed totaskruns_pod_latency_millisecondsin PR #6891 (v0.50.0) - Related cardinality issues: #7811, #7373, #2872 (none specifically about
pod_latency) - Workaround: Drop the metric at Prometheus scrape level:
metric_relabel_configs: - source_labels: [__name__] regex: 'tekton_pipelines_controller_taskruns_pod_latency_milliseconds' action: drop
Possible fixes:
- Add a
metrics.taskrun.pod-latency.enableconfig knob to disable this metric (backwards compatible) - Make the metric respect
metrics.taskrun.level— atnamespacelevel, aggregate without pod/task/taskrun labels - Remove the
podlabel entirely (breaking change for anyone using per-pod latency data) - Document the metric and its cardinality implications in
docs/metrics.md
/kind bug