taskruns_pod_latency_milliseconds metric has unbounded cardinality due to hardcoded pod label

# Expected Behavior

The `tekton_pipelines_controller_taskruns_pod_latency_milliseconds` metric should have bounded cardinality, similar to other controller metrics like `taskrun_duration_seconds` which respect the [`metrics.taskrun.level` configuration](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/apis/config/metrics.go#L26). The `pod` label should either be removable via configuration or not included by default, since pod names are unique per TaskRun and create unbounded time series.

# Actual Behavior

The metric includes a hardcoded [`pod` tag key](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L56) in the [view definition](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L251-L256) that records the unique pod name for every TaskRun via [`tag.Insert(podTag, pod.Name)`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L556). Combined with [`view.LastValue()` aggregation](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L254) (which retains all label sets indefinitely), this creates O(N) time series where N = total TaskRuns ever created.

On busy clusters (e.g., Konflux production), this results in:
- **1.2 million+ active time series** from this single metric
- Prometheus ServiceMonitor scraping timeouts
- **Complete loss of ALL Tekton metrics** (not just this one) because the scrape endpoint becomes too large to read within the timeout

The metric is also **undocumented** — it does not appear in [`docs/metrics.md`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/docs/metrics.md) or on https://tekton.dev/docs/pipelines/metrics/

Evidence:
```
curl http://localhost:9095/metrics | grep tekton_pipelines_controller_taskruns_pod_latency_milliseconds | wc -l
# Returns: 1213602 active series
```

# Steps to Reproduce the Problem

1. Deploy Tekton Pipelines on a cluster with moderate-to-heavy TaskRun volume
2. Run many TaskRuns over time (each creates a unique pod)
3. Query the metrics endpoint: `curl http://localhost:9095/metrics | grep tekton_pipelines_controller_taskruns_pod_latency_milliseconds | wc -l`
4. Observe the series count grows linearly with total TaskRuns and never decreases
5. Eventually, Prometheus scrape timeouts occur and ALL tekton metrics are lost

**Root cause in code:**
- Metric defined: [`pkg/taskrunmetrics/metrics.go:97-99`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L97-L99) — `podLatency` as `stats.Float64`
- Pod tag key: [`pkg/taskrunmetrics/metrics.go:56`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L56) — `podTag = tag.MustNewKey("pod")`
- View with hardcoded pod label: [`pkg/taskrunmetrics/metrics.go:251-256`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L251-L256) — `podLatencyView` with `podTag` in `TagKeys`
- Recording with pod.Name: [`pkg/taskrunmetrics/metrics.go:556`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/taskrunmetrics/metrics.go#L556) — `tag.Insert(podTag, pod.Name)`
- Called from: [`pkg/reconciler/taskrun/taskrun.go:721`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/reconciler/taskrun/taskrun.go#L721) — `c.metrics.RecordPodLatency(ctx, pod, tr)`
- No config disable: [`pkg/apis/config/metrics.go:101-108`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/apis/config/metrics.go#L101-L108) — the `Metrics` struct has no field for disabling pod latency
- `metrics.taskrun.level` ([line 26](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/pkg/apis/config/metrics.go#L26)) only affects `task`/`taskrun` labels, NOT the `pod` label

# Additional Info

- Kubernetes version: v1.30.14 (OCP 4.17.45)
- Tekton Pipeline version: v5.0.5-769 (openshift-pipelines-operator-rh), upstream equivalent varies
- The metric was originally `taskruns_pod_latency`, renamed to `taskruns_pod_latency_milliseconds` in [PR #6891](https://github.com/tektoncd/pipeline/pull/6891) (v0.50.0)
- Related cardinality issues: [#7811](https://github.com/tektoncd/pipeline/issues/7811), [#7373](https://github.com/tektoncd/pipeline/issues/7373), [#2872](https://github.com/tektoncd/pipeline/issues/2872) (none specifically about `pod_latency`)
- **Workaround**: Drop the metric at Prometheus scrape level:
  ```yaml
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'tekton_pipelines_controller_taskruns_pod_latency_milliseconds'
      action: drop
  ```

**Possible fixes:**
1. Add a `metrics.taskrun.pod-latency.enable` config knob to disable this metric (backwards compatible)
2. Make the metric respect `metrics.taskrun.level` — at `namespace` level, aggregate without pod/task/taskrun labels
3. Remove the `pod` label entirely (breaking change for anyone using per-pod latency data)
4. Document the metric and its cardinality implications in [`docs/metrics.md`](https://github.com/tektoncd/pipeline/blob/6206ff7022091f725e8fa23a65f81a24e0ce73ab/docs/metrics.md)

/kind bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

taskruns_pod_latency_milliseconds metric has unbounded cardinality due to hardcoded pod label #9393

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

taskruns_pod_latency_milliseconds metric has unbounded cardinality due to hardcoded pod label #9393

Description

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions