Skip to content

Conversation

@ndr-ds
Copy link
Contributor

@ndr-ds ndr-ds commented Dec 1, 2025

Motivation

Direct export to Tempo from every pod with no sampling can create a very high volume of data. That can cause the proxy/shards to get backpressured and get filled with errors, as well as a performance hit in them.

A two-tier collector architecture (routers receiving from pods, samplers doing tail-based sampling) is more efficient.

Proposal

Add Kubernetes templates for the OTel collector infrastructure:

  • Router deployment template (receives traces from pods)
  • Sampler StatefulSet template (performs tail-based sampling)
  • Associated ConfigMaps and Services

Test Plan

  • Deploy the OTel collector infrastructure
  • Verify traces flow through the pipeline, and there's no bottleneck there anymore
  • Check sampling rates are correct

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

@ndr-ds ndr-ds force-pushed the 11-25-add_otel_collector_to_our_tracing_exports_to_tempo branch from 41fd9c0 to 184f4e0 Compare December 2, 2025 12:55
@ndr-ds ndr-ds force-pushed the 11-25-improve_dashboards branch 2 times, most recently from c42d64a to e439965 Compare December 2, 2025 13:26
@ndr-ds ndr-ds force-pushed the 11-25-add_otel_collector_to_our_tracing_exports_to_tempo branch 2 times, most recently from 85c2474 to c003b5b Compare December 3, 2025 21:30
@ndr-ds ndr-ds force-pushed the 11-25-improve_dashboards branch 2 times, most recently from ad6f9ec to 2b0654b Compare December 4, 2025 12:31
@ndr-ds ndr-ds force-pushed the 11-25-add_otel_collector_to_our_tracing_exports_to_tempo branch from c003b5b to 589cdd3 Compare December 4, 2025 12:31
@ndr-ds ndr-ds changed the base branch from 11-25-improve_dashboards to graphite-base/5049 December 11, 2025 15:35
@ndr-ds ndr-ds force-pushed the 11-25-add_otel_collector_to_our_tracing_exports_to_tempo branch from 589cdd3 to 8911a78 Compare December 11, 2025 15:35
@ndr-ds ndr-ds changed the base branch from graphite-base/5049 to main December 11, 2025 15:36
@ndr-ds ndr-ds force-pushed the 11-25-add_otel_collector_to_our_tracing_exports_to_tempo branch from 38270c4 to be3ef0b Compare January 7, 2026 22:26
@ndr-ds ndr-ds force-pushed the 12-03-limit_max_pending_message_bundles_in_benchmarks branch from 2681799 to 7ed6c81 Compare January 7, 2026 22:26
Comment on lines +138 to +147
// Configure batch processor for high-throughput scenarios
// Larger queue (16k instead of 2k default) to handle benchmark load
// Faster export (100ms instead of 5s default) to prevent queue buildup
let batch_config = opentelemetry_sdk::trace::BatchConfigBuilder::default()
.with_max_queue_size(16384) // 8x default, enough for 8 shards under load
.with_max_export_batch_size(2048) // Larger batches for efficiency
.with_scheduled_delay(std::time::Duration::from_millis(100)) // Fast export to prevent queue buildup
.build();

let batch_processor = BatchSpanProcessor::new(exporter, batch_config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have different configs for when running with benchmark feature and on production?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants