-
Notifications
You must be signed in to change notification settings - Fork 516
[Prototype] OpenTelemetry and Tokio Tracing bridge that properly activates contexts and spans #2420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From CommunityMeeting on Dec 17th: The key idea is to use tokio::tracing::subscriber to activate/deactivate OTel Context. #2378 is a requirement, but okay to manually test with the changes brought in by hand, so not a blocker for prototype. #2438 can be still offered experimental, as this issue will fix the correlation without requiring the change from #2438 |
I've created something that I think should work a bit better here. It still needs some polishing and fixing some of the tests but the broken example should work. I did not test with multiple threads yet, that's probably the main thing that might not work now. Tracking open contexts is also very rudimentary and it won't handle overlapping contexts and out-of-order exits well, but I'll probably just copy whatever behavior is in #2378 sooner or later. If someone tried playing around with it and find issues with it, that would also help. |
Hey @mladedav thanks for sharing your code. I've taken a little different approach where I focus on letting the context propagate with the tokio spans and then OpenTelemetry Tracing can be a feature that can be turned on or off. It also has the same behavior as the |
I've got a working prototype here that runs all the existing This prototype keeps the existing |
I've looked through it and I gotta say it's cool that you were able to do this without changing the API too much (or maybe at all, I don't think I've noticed any breaking changes). I'll start with some potential issues I've noticed which I think can be fixed and at the end will discuss the conceptual differences between the three versions (including the currently released One thing I noticed is that on the first This leads to the following strange behavior where the first child is not a child of the root span because the context wasn't built yet. This seems to happen, because Broken children
#[test]
fn parent_context() {
let mut tracer = TestTracer::default();
let subscriber = tracing_subscriber::registry().with(layer().with_tracer(tracer.clone()));
tracing::subscriber::with_default(subscriber, || {
let root = trace_span!("root");
let child1 = trace_span!("child-1");
child1.set_parent(root.context());
let _enter_root = root.enter();
drop(_enter_root);
let child2 = trace_span!("child-2");
child2.set_parent(root.context());
});
// Let's check the spans
let spans = tracer.spans();
let parent = spans
.iter()
.find(|span| span.name == "root")
.unwrap();
let child1 = spans
.iter()
.find(|span| span.name == "child-1")
.unwrap();
let child2 = spans
.iter()
.find(|span| span.name == "child-2")
.unwrap();
assert_eq!(parent.parent_span_id, otel::SpanId::INVALID);
assert_eq!(child1.parent_span_id, otel::SpanId::INVALID); // This is surprising
assert_eq!(child2.parent_span_id, parent.span_context.span_id());
} I don't know how simple it would be to fix this, maybe the context can also be built on the first There's also other interaction like when the context is built, then recording fields doesn't work anymore Broken record
#[test]
fn record_after() {
let mut tracer = TestTracer::default();
let subscriber = tracing_subscriber::registry().with(layer().with_tracer(tracer.clone()));
tracing::subscriber::with_default(subscriber, || {
let root = trace_span!("root", before = "before", after = "before");
root.record("before", "after");
let _enter_root = root.enter();
drop(_enter_root);
root.record("after", "after");
});
// Let's check the spans
let spans = tracer.spans();
let parent = spans.iter().find(|span| span.name == "root").unwrap();
assert_eq!(parent.parent_span_id, otel::SpanId::INVALID);
assert!(
parent
.attributes
.iter()
.filter(|kv| kv.key.as_str() == "before")
.any(|kv| kv.value.as_str() == "after")
);
// This fails
assert!(
parent
.attributes
.iter()
.filter(|kv| kv.key.as_str() == "after")
.any(|kv| kv.value.as_str() == "after")
);
} I think this should be easy to fix since this can simply work the same way as Another example of this different behavior can be seen here, where Details
#[test]
fn parent_context_2() {
let mut tracer = TestTracer::default();
let subscriber = tracing_subscriber::registry().with(layer().with_tracer(tracer.clone()));
tracing::subscriber::with_default(subscriber, || {
let root = trace_span!("root");
_ = root.enter();
let child1 = trace_span!("child-1");
child1.set_parent(root.context());
trace_span!(parent: &child1, "child-2");
child1.set_parent(root.context());
trace_span!(parent: &child1, "child-3");
});
// Let's check the spans
let spans = tracer.spans();
let root = spans
.iter()
.find(|span| span.name == "root")
.unwrap();
let child1 = spans
.iter()
.find(|span| span.name == "child-1")
.unwrap();
let child2 = spans
.iter()
.find(|span| span.name == "child-2")
.unwrap();
let child3 = spans
.iter()
.find(|span| span.name == "child-3")
.unwrap();
assert_eq!(root.parent_span_id, otel::SpanId::INVALID);
assert_eq!(child1.parent_span_id, root.span_context.span_id());
assert_eq!(child2.parent_span_id, child1.span_context.span_id());
assert_eq!(child3.parent_span_id, root.span_context.span_id()); // This is surprising, the parent should be `child1`
} Also, On a more general note, the design I went with is in big part because I also tried to not allow users to change parent relationships after constructing a given span. The OpenTelemetry spec says that should be immutable and I think that the whole idea of calling I think this was one of the reasons the old implementation built the span only after This implementation builds the span earlier, but that leads to change in behavior of spans at a point which may be hard to track. Currently it would be either first My version just got rid of this method completely for the price of completely changing the way (remote) explicit otel parents are set, which is more restrictive and completely changes the API, but I still think that that's the right way to go in the long term. |
Thank you @mladedav for this very thorough analysis and testing of my POC. I'm going to go through your examples ASAP and see what can be done. As for where we should go with the API, my aim was to see if it would be possible to do this with minimal differences to the current |
@bantonsson @mladedav When you are referring to "API" what component are you referring to? It is not clear to me which APIs are being referred to here. |
I was (and I assume @bantonsson was) referring to the API surface of |
Oh okay! Thanks for clarifying. I wasn't aware that it exposed any API other than what is already covered by |
The main one is Complementary to that there's also There are some more methods on the extension trait but I think these two are the most important. |
Removing from 0.29 milestone. New ETA to be shared. |
Nice to see you here, @bantonsson ! |
What are the next steps here? There are two different prototypes
I guess getting either merged to Basically, I'm not really sure what's the next goal after we have these prototypes. |
Hey @mladedav I had a chat with @cijothomas and @bantonsson in the last couple days about this. Today i'll write up a "high level path to victory" and ping everyone on it to discuss, so we can try drive this forward. |
@mladedav Sorry for the lack of progress here. I'm just getting back to this. I just want to clarify that the intent of my POC is to remove all custom code in |
I didn't mean to put pressure on you or anything, I was genuinely not sure what should happen next so feel free to take your time. As to the custom code, I thought the way presampling works in |
No worries. I've been meaning to get back to this.
You are completely right, and that is why I'm starting the OpenTelemetry |
With @bantonsson's support, I spent a bunch of time last week grokking opentelemetry-tracing, and have addressed the issues identified by @mladedav (#2420 (comment)) and evaluated the performance against the existing code base(see below) . You can find the tests added and the changes on Björn's branch Fixesbroken children change parent more than once using Span hierarchy otherwise should be established in the typical tokio-tracing fashion - either implicitly via tokio-tracing context ("I created a new span within an existing active span"), or via the tokio-tracing sampling_decision_respects_new_parent ** Benchmark Results
Both of these are benchmarked from Björn's fork. I think we should mainly concern ourselves with regressions in
Next Steps
Update 15/04: Updated results table; |
This is a follow up ticket to discussions in #1571 about OpenTelemetry Tracing API vs Tokio-Tracing API.
The aim is to prototype a Tokio Tracing subscriber bridge with OpenTelemetry tracing/logging that will use the overlapping context scopes fix as a basis and keep the OpenTelemetry
Context
up to date with the Tokio Tracing notion of the activeSpan
. This bridge will activate/deactivate the OpenTelemetryContext
andSpan
on the threads at the proper points instead of only briefly activating the OpenTelemetryContext
orSpan
when the Tokio TracingSpan
is finished. More concretely there will be a stack that mimics the stack in Tokio Tracing that deactivates the right OpenTelemetryContext
corresponding to the Tokio TracingSpan
.This will not (at least not initially) provide seamless two-way interoperability as described in Option 3 here, but instead allow the OpenTelemetry
Context
to travel with Tokio TracingSpans
, so that standard OpenTelemetry API calls will find the rightContext
andSpan
The goal is to make code like this broken example work seamlessly from the OpenTelemetry point of view, allowing end user code to use OpenTelemetry API calls that will work with frameworks and libraries that are using Tokio Tracing.
I would love to get feedback on the idea and hear about other peoples experience in trying to fix the interoperability.
The text was updated successfully, but these errors were encountered: