|
| 1 | +--- |
| 2 | +title: "Distributed Tracing" |
| 3 | +description: "Monitor request flows across services with OpenTelemetry integration for performance debugging and system observability" |
| 4 | +--- |
| 5 | + |
| 6 | +Acontext now includes comprehensive distributed tracing support through OpenTelemetry integration. This enables you to track requests as they flow through your entire system, from API endpoints through core services, database operations, and external service calls. |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +Distributed tracing provides end-to-end visibility into how requests are processed across multiple services. When a request comes in, Acontext automatically creates a trace that follows the request through: |
| 11 | + |
| 12 | +- **acontext-api**: HTTP API layer (Go service) |
| 13 | +- **acontext-core**: Core business logic (Python service) |
| 14 | +- **Database operations**: SQL queries and transactions |
| 15 | +- **Cache operations**: Redis interactions |
| 16 | +- **Storage operations**: S3 blob storage |
| 17 | +- **Message queue**: RabbitMQ message processing |
| 18 | +- **LLM operations**: Embedding and completion calls |
| 19 | + |
| 20 | +<Info> |
| 21 | +Traces are automatically collected when OpenTelemetry is enabled in your deployment. The system uses Jaeger as the trace backend for storage and visualization. |
| 22 | +</Info> |
| 23 | + |
| 24 | +## How It Works |
| 25 | + |
| 26 | +Acontext uses OpenTelemetry to instrument both the API and Core services: |
| 27 | + |
| 28 | +### Automatic Instrumentation |
| 29 | + |
| 30 | +The following operations are automatically traced: |
| 31 | + |
| 32 | +- **HTTP requests**: All API endpoints are instrumented with request/response details |
| 33 | +- **Database queries**: SQL operations are traced with query details |
| 34 | +- **Cache operations**: Redis get/set operations |
| 35 | +- **Storage operations**: S3 upload/download operations |
| 36 | +- **Message processing**: Async message queue operations |
| 37 | +- **LLM calls**: Embedding and completion API calls |
| 38 | + |
| 39 | +### Cross-Service Tracing |
| 40 | + |
| 41 | +When a request flows from `acontext-api` to `acontext-core`, the trace context is automatically propagated using OpenTelemetry's trace context headers. This creates a unified trace showing the complete request flow across both services. |
| 42 | + |
| 43 | +<Frame caption="Traces viewer showing distributed traces with hierarchical span visualization"> |
| 44 | +<img src="/images/dashboard/traces_viewer.png" alt="Traces viewer interface displaying traces with expandable spans, color-coded services, HTTP method badges, and duration visualization" /> |
| 45 | +</Frame> |
| 46 | + |
| 47 | +## Viewing Traces |
| 48 | + |
| 49 | +### Dashboard Traces Viewer |
| 50 | + |
| 51 | +Access the traces viewer from the dashboard to see all traces in your system: |
| 52 | + |
| 53 | +- **Time range filtering**: Filter traces by time ranges (15 minutes, 1 hour, 6 hours, 24 hours, or 7 days) |
| 54 | +- **Auto-refresh**: Automatically refreshes every 30 seconds |
| 55 | +- **Hierarchical visualization**: Expand traces to view nested spans showing the complete request flow |
| 56 | +- **Service identification**: Color-coded spans distinguish between services (acontext-api in teal, acontext-core in blue) |
| 57 | +- **HTTP method badges**: Quickly identify request types |
| 58 | +- **Duration visualization**: Visual timeline bars show relative execution times |
| 59 | +- **Trace ID**: Copy trace IDs to correlate with logs and metrics |
| 60 | + |
| 61 | +<Tip> |
| 62 | +Click the external link icon next to a trace ID to open the detailed trace view in Jaeger UI for advanced analysis. |
| 63 | +</Tip> |
| 64 | + |
| 65 | +### Jaeger UI |
| 66 | + |
| 67 | +For advanced trace analysis, you can access Jaeger UI directly. The traces viewer provides a link to open each trace in Jaeger, where you can: |
| 68 | + |
| 69 | +- View detailed span attributes and tags |
| 70 | +- Analyze trace dependencies and service maps |
| 71 | +- Filter and search traces by various criteria |
| 72 | +- Compare trace performance over time |
| 73 | + |
| 74 | +## Configuration |
| 75 | + |
| 76 | +Tracing is configured through environment variables. The following settings control tracing behavior: |
| 77 | + |
| 78 | +### Core Service (Python) |
| 79 | + |
| 80 | +```bash |
| 81 | +# Enable/disable tracing |
| 82 | +TELEMETRY_ENABLED=true |
| 83 | + |
| 84 | +# OTLP endpoint (Jaeger collector) |
| 85 | +TELEMETRY_OTLP_ENDPOINT=http://localhost:4317 |
| 86 | + |
| 87 | +# Sampling ratio (0.0-1.0, default 1.0 = 100% sampling) |
| 88 | +TELEMETRY_SAMPLE_RATIO=1.0 |
| 89 | + |
| 90 | +# Service name for tracing |
| 91 | +TELEMETRY_SERVICE_NAME=acontext-core |
| 92 | +``` |
| 93 | + |
| 94 | +### API Service (Go) |
| 95 | + |
| 96 | +```yaml |
| 97 | +telemetry: |
| 98 | + enabled: true |
| 99 | + otlp_endpoint: "localhost:4317" |
| 100 | + sample_ratio: 1.0 |
| 101 | +``` |
| 102 | +
|
| 103 | +<Warning> |
| 104 | +In production environments, consider using a sampling ratio less than 1.0 (e.g., 0.1 for 10% sampling) to reduce storage costs and overhead while still capturing representative traces. |
| 105 | +</Warning> |
| 106 | +
|
| 107 | +## Understanding Traces |
| 108 | +
|
| 109 | +### Trace Structure |
| 110 | +
|
| 111 | +Each trace consists of: |
| 112 | +
|
| 113 | +- **Root span**: The initial request entry point (usually an HTTP endpoint) |
| 114 | +- **Child spans**: Operations performed during request processing |
| 115 | +- **Nested spans**: Operations that are part of larger operations |
| 116 | +
|
| 117 | +### Span Information |
| 118 | +
|
| 119 | +Each span contains: |
| 120 | +
|
| 121 | +- **Operation name**: The operation being performed (e.g., `GET /api/v1/session/:session_id/get_learning_status`) |
| 122 | +- **Service name**: Which service performed the operation (`acontext-api` or `acontext-core`) |
| 123 | +- **Duration**: How long the operation took |
| 124 | +- **Tags**: Additional metadata (HTTP method, status codes, error information) |
| 125 | +- **Timestamps**: When the operation started and ended |
| 126 | + |
| 127 | +### Service Colors |
| 128 | + |
| 129 | +In the traces viewer, spans are color-coded by service: |
| 130 | + |
| 131 | +- **Teal**: `acontext-api` operations |
| 132 | +- **Blue**: `acontext-core` operations |
| 133 | +- **Gray**: Other services or unknown operations |
| 134 | + |
| 135 | +## Use Cases |
| 136 | + |
| 137 | +<AccordionGroup> |
| 138 | +<Accordion title="Performance debugging"> |
| 139 | +Identify slow operations and bottlenecks in your system by analyzing trace durations. Expand traces to see which specific operation is taking the most time. |
| 140 | + |
| 141 | +```python |
| 142 | +# Traces automatically show up in the dashboard |
| 143 | +# No code changes needed - just enable tracing in your configuration |
| 144 | +``` |
| 145 | + |
| 146 | +1. Open the traces viewer in the dashboard |
| 147 | +2. Filter by time range to focus on recent requests |
| 148 | +3. Look for traces with long durations |
| 149 | +4. Expand the trace to see which span is slow |
| 150 | +5. Check the operation name and service to identify the bottleneck |
| 151 | +</Accordion> |
| 152 | + |
| 153 | +<Accordion title="Error investigation"> |
| 154 | +When an error occurs, use the trace ID to correlate logs and understand the full request flow that led to the error. |
| 155 | + |
| 156 | +1. Find the error in your logs and note the trace ID |
| 157 | +2. Search for the trace ID in the traces viewer |
| 158 | +3. Expand the trace to see the complete request flow |
| 159 | +4. Identify which service and operation failed |
| 160 | +5. Check span tags for error details |
| 161 | +</Accordion> |
| 162 | + |
| 163 | +<Accordion title="Service dependency analysis"> |
| 164 | +Understand how your services interact by analyzing trace flows. See which services call which other services and how frequently. |
| 165 | + |
| 166 | +1. View traces in Jaeger UI for advanced analysis |
| 167 | +2. Use Jaeger's service map view to visualize dependencies |
| 168 | +3. Analyze trace patterns to understand service communication |
| 169 | +</Accordion> |
| 170 | + |
| 171 | +<Accordion title="Performance optimization"> |
| 172 | +Compare trace durations before and after optimizations to measure improvements. |
| 173 | + |
| 174 | +1. Note trace durations for specific operations before optimization |
| 175 | +2. Make your optimizations |
| 176 | +3. Compare new trace durations to verify improvements |
| 177 | +4. Use trace data to identify the next optimization target |
| 178 | +</Accordion> |
| 179 | +</AccordionGroup> |
| 180 | + |
| 181 | +## Best Practices |
| 182 | + |
| 183 | +<CardGroup cols={2}> |
| 184 | +<Card title="Use sampling in production" icon="chart-line"> |
| 185 | +Configure a sampling ratio (e.g., 0.1 for 10%) to reduce storage costs while maintaining observability. |
| 186 | +</Card> |
| 187 | + |
| 188 | +<Card title="Correlate with logs" icon="link"> |
| 189 | +Use trace IDs from traces to find related log entries and get complete context for debugging. |
| 190 | +</Card> |
| 191 | + |
| 192 | +<Card title="Monitor trace volume" icon="eye"> |
| 193 | +Watch trace collection rates to ensure your sampling ratio is appropriate for your traffic volume. |
| 194 | +</Card> |
| 195 | + |
| 196 | +<Card title="Set up alerts" icon="bell"> |
| 197 | +Configure alerts based on trace durations to catch performance regressions early. |
| 198 | +</Card> |
| 199 | +</CardGroup> |
| 200 | + |
| 201 | +## Next Steps |
| 202 | + |
| 203 | +<CardGroup cols={2}> |
| 204 | +<Card title="Dashboard" icon="chart-simple" href="/observe/dashboard"> |
| 205 | +View traces alongside other observability data in the unified dashboard. |
| 206 | +</Card> |
| 207 | + |
| 208 | +<Card title="Settings" icon="gear" href="/settings/runtime"> |
| 209 | +Configure tracing settings and sampling ratios for your deployment. |
| 210 | +</Card> |
| 211 | +</CardGroup> |
| 212 | + |
0 commit comments