Skip to content

Optimize etcd/apiserver histogram storage after 15 days #10887

@egulatee

Description

@egulatee

Overview

Optimize Prometheus storage usage by reducing histogram bucket cardinality for etcd and apiserver metrics while maintaining diagnostic value through pre-computed recording rules. This builds on Issue #78's storage breakdown visualizations.

Research Findings

Industry Best Practices

  • Cloud Providers: GKE/EKS/AKS retain control plane metrics for 10d-365d+ with configurable retention
  • Compliance: SOC2 requires 1 year, ISO27001 recommends 3 years
  • Thanos Standard: 7d raw, 30d 5-min, 2y 1-hour (matches our current config ✅)
  • Production Companies: Datadog (15mo), Grafana Cloud (13mo), Red Hat (24h + Thanos)

High-Cardinality Metrics Analysis

From current cardinality dashboard:

Metric Series Count Current Action Optimization Opportunity
etcd_request_duration_seconds_bucket 24.9K ✅ Dropped Already optimized
apiserver_request_sli_duration_seconds_bucket 14.2K ❌ Not addressed Drop entirely (has summary metrics)
apiserver_request_duration_seconds_bucket 12.0K ⚠️ Reduced 90% Keep as-is (essential)
apiserver_request_body_size_bytes_bucket 6.05K ❌ Not addressed Reduce to 8 buckets
apiserver_response_sizes_bucket 4.59K ❌ Not addressed Reduce to 8 buckets

Total potential savings: ~21K series (35% reduction)

Value After 15 Days

Time Period Resolution Histogram Value Recommendation
0-7 days Raw (1-min) ⭐⭐⭐⭐⭐ Critical Keep all (already configured)
7-30 days 5-min ⭐⭐⭐⭐ High Keep downsampled (already configured)
30-90 days 5-min ⭐⭐⭐ Medium Use recording rules
90+ days 1-hour ⭐⭐ Low Pre-computed percentiles only

Key Finding: After 15 days, granular histogram buckets have diminishing value. Recording rules provide same diagnostic capability with 99% less storage.

Proposed Changes

1. Add Recording Rules (Safe Addition)

Create new PrometheusRule resource with:

  • apiserver percentiles: P50/P90/P95/P99 by verb/resource (evaluated every 1min)
  • etcd capacity aggregates: DB size trends, disk I/O max (evaluated every 5min)
  • Result: ~200 new low-cardinality series vs 21K high-cardinality histogram buckets

2. Extend Metric Relabeling

Update metricRelabelConfigs in prometheus/index.ts:

// Drop apiserver SLI histogram (has summary metrics already)
{
    sourceLabels: ["__name__"],
    regex: "apiserver_request_sli_duration_seconds_bucket",
    action: "drop"
},

// Keep only essential buckets for request body size
{
    sourceLabels: ["__name__", "le"],
    regex: "apiserver_request_body_size_bytes_bucket;(1024|4096|16384|65536|262144|1.048576e\\+06|4.194304e\\+06|\\+Inf)",
    action: "keep"
},
{
    sourceLabels: ["__name__"],
    regex: "apiserver_request_body_size_bytes_bucket",
    action: "drop"
},

// Keep only essential buckets for response sizes
{
    sourceLabels: ["__name__", "le"],
    regex: "apiserver_response_sizes_bucket;(1024|4096|16384|65536|262144|1.048576e\\+06|4.194304e\\+06|\\+Inf)",
    action: "keep"
},
{
    sourceLabels: ["__name__"],
    regex: "apiserver_response_sizes_bucket",
    action: "drop"
}

3. Verify Thanos Compactor

Ensure explicit retention flags:

"--retention.resolution-raw=7d",
"--retention.resolution-5m=30d",
"--retention.resolution-1h=730d"

4. Documentation

Create RETENTION-POLICY.md documenting:

  • Retention strategy rationale
  • Industry research findings
  • Compliance requirements (SOC2/ISO27001)
  • Quarterly review process

Expected Outcomes

Storage Impact

  • Series reduction: 60K → 39K (35% reduction)
  • Local storage: 2.6GB → 1.7GB (15-day Prometheus)
  • Thanos storage: 2.5GB → 2.0GB (2-year retention)
  • Recording rules overhead: +200 series (+420MB over 2 years)

Performance Impact

  • Query speed: 10-100× faster for percentile queries (pre-computed vs histogram_quantile)
  • Dashboard load time: Reduced (fewer series to query)
  • Alert evaluation: Faster (use recording rules in alert expressions)

No Regressions

  • ✅ All diagnostic capability preserved via recording rules
  • ✅ 2-year retention maintained (compliance requirements)
  • ✅ SLO tracking unaffected (use pre-computed percentiles)
  • ✅ Capacity planning improved (dedicated aggregates)

Implementation Plan

  1. Phase 1: Add recording rules (safe - only adds new metrics)
  2. Phase 2: Validate recording rules in Grafana dashboards (1 week test)
  3. Phase 3: Add metric relabeling rules (reduces cardinality at scrape time)
  4. Phase 4: Verify Thanos Compactor configuration
  5. Phase 5: Create documentation
  6. Phase 6: Monitor for 24-48 hours, then merge

Validation Steps

# Check series count reduction
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | to_entries | sort_by(.value) | reverse | .[0:15]'

# Verify recording rules are evaluating
curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[] | select(.name | contains("apiserver"))'

# Check Thanos downsampling success
kubectl logs -n prometheus thanos-compactor-0 | grep "downsampling"

# Test percentile query performance
time curl -s 'http://prometheus:9090/api/v1/query?query=apiserver:request_duration_seconds:p95'

Risk Mitigation

Risk Mitigation
Recording rules fail Deploy with validation, monitor evaluation duration
Dashboards break Test with recording rules before dropping buckets
Data loss Recording rules capture percentiles before bucket cleanup
Thanos compactor issues Verify configuration, monitor compaction metrics

References

Related Work

This optimization follows the successful pattern from:

Metadata

Metadata

Assignees

No one assigned

    Labels

    invalidIssue doesn't seem to be related to Claude Code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions