Skip to content

feat(observability): add OTEL root and client spans for MCP flows#3872

Open
vishu-bh wants to merge 4 commits intomainfrom
otel-phase1-request-root-spans
Open

feat(observability): add OTEL root and client spans for MCP flows#3872
vishu-bh wants to merge 4 commits intomainfrom
otel-phase1-request-root-spans

Conversation

@vishu-bh
Copy link
Copy Markdown
Collaborator

@vishu-bh vishu-bh commented Mar 26, 2026

🔗 Related Issue

Closes #3858
Refs #3736


📝 Summary

This PR builds out the gateway-side OpenTelemetry trace tree for MCP traffic and plugin execution.

Implemented so far:

  • added OTEL request-root spans for gateway transport paths such as /rpc, /mcp, server-scoped MCP routes, and internal MCP transport hops
  • added shared W3C trace-context helpers and outbound traceparent / tracestate injection
  • added MCP client lifecycle spans on the Python gateway path:
    • mcp.client.call
    • mcp.client.initialize
    • mcp.client.request
    • mcp.client.response
  • added explicit gateway-side response tracing so upstream success is visible in the trace even before the upstream service is instrumented
  • added Python FastMCP upstream runtime support so Python MCP servers can join the distributed trace when OTEL is enabled in that server process
  • added plugin framework tracing at the shared hook dispatch layer:
    • plugin.hook.invoke for the hook chain
    • plugin.execute for each plugin execution
  • recorded plugin stop-chain behavior in trace attributes, including which plugin stopped processing
  • preserved Rust compatibility by continuing to inject W3C trace headers into Rust direct-execution plans

Current gateway-side trace shape for an MCP tool call is now roughly:

  • POST /rpc
  • tool.invoke
  • plugin.hook.invoke / plugin.execute for pre-invoke hooks when configured
  • mcp.client.call
  • mcp.client.initialize
  • mcp.client.request
  • mcp.client.response
  • plugin.hook.invoke / plugin.execute for post-invoke hooks when configured

Upstream server spans for non-Python services such as fast-time-server are not part of this PR. Those services can join the same trace by extracting traceparent / tracestate and exporting OTEL spans from their own runtime.


🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🧪 Verification

Check Command Status
Lint suite make lint Not run
Unit tests make test Focused tests run
Coverage ≥ 80% make coverage Not run

Focused verification run:

  • python -m py_compile mcpgateway/observability.py mcpgateway/main.py mcpgateway/plugins/framework/external/mcp/server/runtime.py mcpgateway/services/tool_service.py tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py tests/unit/mcpgateway/services/test_tool_service.py tests/unit/mcpgateway/plugins/framework/test_observability.py
  • pytest tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py -q
  • pytest tests/unit/mcpgateway/services/test_tool_service.py -k "streamablehttp_creates_client_lifecycle_spans or invoke_tool_mcp_streamablehttp or prepare_rust_mcp_tool_execution_injects_w3c_trace_context_into_plan_headers" -q
  • pytest tests/unit/mcpgateway/plugins/framework/test_observability.py -q

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (if applicable)
  • No secrets or credentials committed

📓 Notes

  • This PR focuses on the gateway and Python-side trace tree. Upstream MCP server instrumentation can be a separate follow-up.

@vishu-bh vishu-bh force-pushed the otel-phase1-request-root-spans branch 2 times, most recently from e46fcc4 to fb014ce Compare March 27, 2026 11:20
@vishu-bh vishu-bh marked this pull request as ready for review March 27, 2026 11:20
@crivetimihai crivetimihai changed the title feat: add OTEL root and client spans for MCP flows feat(observability): add OTEL root and client spans for MCP flows Mar 29, 2026
@crivetimihai crivetimihai added enhancement New feature or request SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release observability Observability, logging, monitoring labels Mar 29, 2026
@crivetimihai crivetimihai added this to the Release 1.1.0 milestone Mar 29, 2026
@crivetimihai
Copy link
Copy Markdown
Member

Thanks @vishu-bh. Comprehensive OTEL instrumentation across MCP transports — this will be very valuable for production debugging. A few notes:

  1. Please check PR feat: integrate Langfuse LLM observability via OTEL #3900 (feat: integrate Langfuse LLM observability via OTEL) which implements much of this same observability surface. Coordinate to avoid duplication and merge conflicts.
  2. Given the breadth of changes (14 files), please ensure all spans follow the naming conventions established in the existing observability middleware.
  3. DCO Signed-off-by is required on all commits.

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
@vishu-bh vishu-bh force-pushed the otel-phase1-request-root-spans branch from fb014ce to f7bb46a Compare March 29, 2026 22:23
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request observability Observability, logging, monitoring SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TASK][OBSERVABILITY]: Add end-to-end OTEL trace trees and W3C propagation for gateway to MCP servers

2 participants