Skip to content

taskruns_pod_latency_milliseconds metric has unbounded cardinality due to hardcoded pod label #9393

@waveywaves

Description

@waveywaves

Expected Behavior

The tekton_pipelines_controller_taskruns_pod_latency_milliseconds metric should have bounded cardinality, similar to other controller metrics like taskrun_duration_seconds which respect the metrics.taskrun.level configuration. The pod label should either be removable via configuration or not included by default, since pod names are unique per TaskRun and create unbounded time series.

Actual Behavior

The metric includes a hardcoded pod tag key in the view definition that records the unique pod name for every TaskRun via tag.Insert(podTag, pod.Name). Combined with view.LastValue() aggregation (which retains all label sets indefinitely), this creates O(N) time series where N = total TaskRuns ever created.

On busy clusters (e.g., Konflux production), this results in:

  • 1.2 million+ active time series from this single metric
  • Prometheus ServiceMonitor scraping timeouts
  • Complete loss of ALL Tekton metrics (not just this one) because the scrape endpoint becomes too large to read within the timeout

The metric is also undocumented — it does not appear in docs/metrics.md or on https://tekton.dev/docs/pipelines/metrics/

Evidence:

curl http://localhost:9095/metrics | grep tekton_pipelines_controller_taskruns_pod_latency_milliseconds | wc -l
# Returns: 1213602 active series

Steps to Reproduce the Problem

  1. Deploy Tekton Pipelines on a cluster with moderate-to-heavy TaskRun volume
  2. Run many TaskRuns over time (each creates a unique pod)
  3. Query the metrics endpoint: curl http://localhost:9095/metrics | grep tekton_pipelines_controller_taskruns_pod_latency_milliseconds | wc -l
  4. Observe the series count grows linearly with total TaskRuns and never decreases
  5. Eventually, Prometheus scrape timeouts occur and ALL tekton metrics are lost

Root cause in code:

Additional Info

  • Kubernetes version: v1.30.14 (OCP 4.17.45)
  • Tekton Pipeline version: v5.0.5-769 (openshift-pipelines-operator-rh), upstream equivalent varies
  • The metric was originally taskruns_pod_latency, renamed to taskruns_pod_latency_milliseconds in PR #6891 (v0.50.0)
  • Related cardinality issues: #7811, #7373, #2872 (none specifically about pod_latency)
  • Workaround: Drop the metric at Prometheus scrape level:
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'tekton_pipelines_controller_taskruns_pod_latency_milliseconds'
        action: drop

Possible fixes:

  1. Add a metrics.taskrun.pod-latency.enable config knob to disable this metric (backwards compatible)
  2. Make the metric respect metrics.taskrun.level — at namespace level, aggregate without pod/task/taskrun labels
  3. Remove the pod label entirely (breaking change for anyone using per-pod latency data)
  4. Document the metric and its cardinality implications in docs/metrics.md

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions