-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
invalidIssue doesn't seem to be related to Claude CodeIssue doesn't seem to be related to Claude Code
Description
Overview
Optimize Prometheus storage usage by reducing histogram bucket cardinality for etcd and apiserver metrics while maintaining diagnostic value through pre-computed recording rules. This builds on Issue #78's storage breakdown visualizations.
Research Findings
Industry Best Practices
- Cloud Providers: GKE/EKS/AKS retain control plane metrics for 10d-365d+ with configurable retention
- Compliance: SOC2 requires 1 year, ISO27001 recommends 3 years
- Thanos Standard: 7d raw, 30d 5-min, 2y 1-hour (matches our current config ✅)
- Production Companies: Datadog (15mo), Grafana Cloud (13mo), Red Hat (24h + Thanos)
High-Cardinality Metrics Analysis
From current cardinality dashboard:
| Metric | Series Count | Current Action | Optimization Opportunity |
|---|---|---|---|
etcd_request_duration_seconds_bucket |
24.9K | ✅ Dropped | Already optimized |
apiserver_request_sli_duration_seconds_bucket |
14.2K | ❌ Not addressed | Drop entirely (has summary metrics) |
apiserver_request_duration_seconds_bucket |
12.0K | Keep as-is (essential) | |
apiserver_request_body_size_bytes_bucket |
6.05K | ❌ Not addressed | Reduce to 8 buckets |
apiserver_response_sizes_bucket |
4.59K | ❌ Not addressed | Reduce to 8 buckets |
Total potential savings: ~21K series (35% reduction)
Value After 15 Days
| Time Period | Resolution | Histogram Value | Recommendation |
|---|---|---|---|
| 0-7 days | Raw (1-min) | ⭐⭐⭐⭐⭐ Critical | Keep all (already configured) |
| 7-30 days | 5-min | ⭐⭐⭐⭐ High | Keep downsampled (already configured) |
| 30-90 days | 5-min | ⭐⭐⭐ Medium | Use recording rules |
| 90+ days | 1-hour | ⭐⭐ Low | Pre-computed percentiles only |
Key Finding: After 15 days, granular histogram buckets have diminishing value. Recording rules provide same diagnostic capability with 99% less storage.
Proposed Changes
1. Add Recording Rules (Safe Addition)
Create new PrometheusRule resource with:
- apiserver percentiles: P50/P90/P95/P99 by verb/resource (evaluated every 1min)
- etcd capacity aggregates: DB size trends, disk I/O max (evaluated every 5min)
- Result: ~200 new low-cardinality series vs 21K high-cardinality histogram buckets
2. Extend Metric Relabeling
Update metricRelabelConfigs in prometheus/index.ts:
// Drop apiserver SLI histogram (has summary metrics already)
{
sourceLabels: ["__name__"],
regex: "apiserver_request_sli_duration_seconds_bucket",
action: "drop"
},
// Keep only essential buckets for request body size
{
sourceLabels: ["__name__", "le"],
regex: "apiserver_request_body_size_bytes_bucket;(1024|4096|16384|65536|262144|1.048576e\\+06|4.194304e\\+06|\\+Inf)",
action: "keep"
},
{
sourceLabels: ["__name__"],
regex: "apiserver_request_body_size_bytes_bucket",
action: "drop"
},
// Keep only essential buckets for response sizes
{
sourceLabels: ["__name__", "le"],
regex: "apiserver_response_sizes_bucket;(1024|4096|16384|65536|262144|1.048576e\\+06|4.194304e\\+06|\\+Inf)",
action: "keep"
},
{
sourceLabels: ["__name__"],
regex: "apiserver_response_sizes_bucket",
action: "drop"
}3. Verify Thanos Compactor
Ensure explicit retention flags:
"--retention.resolution-raw=7d",
"--retention.resolution-5m=30d",
"--retention.resolution-1h=730d"4. Documentation
Create RETENTION-POLICY.md documenting:
- Retention strategy rationale
- Industry research findings
- Compliance requirements (SOC2/ISO27001)
- Quarterly review process
Expected Outcomes
Storage Impact
- Series reduction: 60K → 39K (35% reduction)
- Local storage: 2.6GB → 1.7GB (15-day Prometheus)
- Thanos storage: 2.5GB → 2.0GB (2-year retention)
- Recording rules overhead: +200 series (+420MB over 2 years)
Performance Impact
- Query speed: 10-100× faster for percentile queries (pre-computed vs histogram_quantile)
- Dashboard load time: Reduced (fewer series to query)
- Alert evaluation: Faster (use recording rules in alert expressions)
No Regressions
- ✅ All diagnostic capability preserved via recording rules
- ✅ 2-year retention maintained (compliance requirements)
- ✅ SLO tracking unaffected (use pre-computed percentiles)
- ✅ Capacity planning improved (dedicated aggregates)
Implementation Plan
- Phase 1: Add recording rules (safe - only adds new metrics)
- Phase 2: Validate recording rules in Grafana dashboards (1 week test)
- Phase 3: Add metric relabeling rules (reduces cardinality at scrape time)
- Phase 4: Verify Thanos Compactor configuration
- Phase 5: Create documentation
- Phase 6: Monitor for 24-48 hours, then merge
Validation Steps
# Check series count reduction
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | to_entries | sort_by(.value) | reverse | .[0:15]'
# Verify recording rules are evaluating
curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[] | select(.name | contains("apiserver"))'
# Check Thanos downsampling success
kubectl logs -n prometheus thanos-compactor-0 | grep "downsampling"
# Test percentile query performance
time curl -s 'http://prometheus:9090/api/v1/query?query=apiserver:request_duration_seconds:p95'Risk Mitigation
| Risk | Mitigation |
|---|---|
| Recording rules fail | Deploy with validation, monitor evaluation duration |
| Dashboards break | Test with recording rules before dropping buckets |
| Data loss | Recording rules capture percentiles before bucket cleanup |
| Thanos compactor issues | Verify configuration, monitor compaction metrics |
References
- Issue GitHub PR Comment Retrieval Limitation #78: Storage breakdown dashboard
- Issue Agent is getting stuck during generation, burns API usage pretty quickly on retry #75: Network metrics optimization (50-75% reduction)
- Issue Not running: SyntaxError: Unexpected token '=' #66: Scrape interval optimization (60s)
- Thanos retention config:
prometheus/src/thanos/compactor.ts:373-377 - Current relabeling:
prometheus/index.ts:82-100
Related Work
This optimization follows the successful pattern from:
- Issue OAuth error: fetch failed #40: Initial etcd histogram bucket reduction
- Issue Agent is getting stuck during generation, burns API usage pretty quickly on retry #75: Network metrics cardinality optimization
- Issue GitHub PR Comment Retrieval Limitation #78: Storage breakdown visualizations enabling this analysis
Metadata
Metadata
Assignees
Labels
invalidIssue doesn't seem to be related to Claude CodeIssue doesn't seem to be related to Claude Code