[Bug]: Shutdown doesn't flush when used with global subscriber #1961

stepantubanov · 2024-07-24T12:23:11Z

What happened?

As mentioned in #1625 - Tracer now holds strong reference to TracerProvider.

When opentelemetry used as a layer with global tracing subscriber it is now impossible to shutdown properly (it only decrements a reference, but doesn't execute Drop).

let layer = tracing_opentelemetry::layer()
  .with_tracer(tracer_provider.tracer("app"));

opentelemetry::global::set_tracer_provider(tracer_provider);
tracing::subscriber::set_global_default(Registry::default().with(layer));

// shutdown call does not actually shutdown global tracer provider
opentelemetry::global::shutdown_tracer_provider();

As a result some spans are missing, flattened, etc.

EDIT: Possible workaround is to flush manually:

tracer_provider.force_flush();

API Version

0.24.0

SDK Version

0.24.1

What Exporter(s) are you seeing the problem on?

OTLP

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

lalitb · 2024-07-24T16:06:08Z

When opentelemetry used as a layer with global tracing subscriber it is now impossible to shutdown properly (it only decrements a reference, but doesn't execute Drop).

That's expected behavior. The global::shutdown_tracer_provider() should only decrement the reference it is holding. There shouldn't be any flush/shutdown mechanism exposed at API/global level as per the specs. The application needs to keep hold of the sdk::TracerProvider handle, and use it to invoke shutdown/flush. And during graceful shutdown of application, the shutdown would be invoked implicitly through Drop which internally does the flush before shutdown. cc @TommyCpp

stepantubanov · 2024-07-24T19:16:10Z

That's expected behavior. The global::shutdown_tracer_provider() should only decrement the reference it is holding. There shouldn't be any flush/shutdown mechanism exposed at API/global level as per the specs.

It's kind of misleading for global::shutdown_tracer_provider() to not be a shutdown mechanism for global tracer. Currently the doc reads:

Shut down the current tracer provider. This will invoke the shutdown method on all span processors. span processors should export remaining spans before return
https://docs.rs/opentelemetry/0.24.0/opentelemetry/global/fn.shutdown_tracer_provider.html

If current behavior (no shutdown if there are other references) is intentional and should remain, then I think it'd be helpful to change the documentation (explicitly state that it's not going to do anything if there's more references to the tracer provider, suggest calling flush/shutdown on the tracer provider directly).

And during graceful shutdown of application, the shutdown would be invoked implicitly through Drop which internally does the flush before shutdown.

In previous opentelemetry version it used to work like that (because Tracer had a weak reference to TracerProvider) for the given example, but with the new version it no longer works. And it was a bit surprising to debug this issue and find out call to shutdown_tracer_provider didn't do anything.

lalitb · 2024-07-24T20:10:33Z

If current behavior (no shutdown if there are other references) is intentional and should remain, then I think it'd be helpful to change the documentation (explicitly state that it's not going to do anything if there's more references to the tracer provider, suggest calling flush/shutdown on the tracer provider directly).

Agree, doc is misleading. We can use this issue to track the improvement. Also, I think global::shutdown_tracer_provider() can be named better say global::release_tracer_provider().

mzabaluev · 2024-10-15T20:51:21Z

Here's a test against opentelemetry_sdk:

#![cfg(feature = "rt-tokio")]

use futures_util::future::BoxFuture;
use opentelemetry::global as otel_global;
use opentelemetry::trace::{TracerProvider as _, Tracer as _};
use opentelemetry_sdk::{
    export::trace::{ExportResult, SpanData, SpanExporter},
    runtime,
    trace::TracerProvider,
};
use tokio::runtime::Runtime;

use std::sync::{Arc, Mutex};

#[derive(Clone, Debug, Default)]
struct TestExporter(Arc<Mutex<Vec<SpanData>>>);

impl SpanExporter for TestExporter {
    fn export(&mut self, mut batch: Vec<SpanData>) -> BoxFuture<'static, ExportResult> {
        let spans = self.0.clone();
        Box::pin(async move {
            if let Ok(mut inner) = spans.lock() {
                inner.append(&mut batch);
            }
            Ok(())
        })
    }
}

fn test_tracer(runtime: &Runtime) -> (TracerProvider, TestExporter) {
    let _guard = runtime.enter();

    let exporter = TestExporter::default();
    let provider = TracerProvider::builder()
        .with_batch_exporter(exporter.clone(), runtime::Tokio)
        .build();

    (provider, exporter)
}

#[test]
fn shutdown_global() {
    let rt = Runtime::new().unwrap();
    let (provider, exporter) = test_tracer(&rt);

    otel_global::set_tracer_provider(provider);

    let tracer = otel_global::tracer("test");
    for _ in 0..1000 {
        tracer.start("test_span");
    }
    // drop(tracer);

    // Should flush all batched telemetry spans
    otel_global::shutdown_tracer_provider();

    let spans = exporter.0.lock().unwrap();
    assert_eq!(spans.len(), 1000);
}

Dropping the tracer before the call to shutdown_tracer_provider makes the test succeed.

As currently implemented, the behavior of shutdown_tracer_provider is fragile against existence of remaining references to the provider, which may be still in scope, or owned by other globally set data like in tokio-rs/tracing-opentelemetry#159, or simply leaked. This is in contrast to TracerProvider::shutdown, which works as expected.

mzabaluev · 2024-10-16T09:31:25Z

This is in contrast to TracerProvider::shutdown, which works as expected.

Here's a test to exercise this. However, it does produce an error printout when the tracer instance is kept alive past the shutdown.

#[test]
fn shutdown_in_scope() {
    let rt = Runtime::new().unwrap();
    let (provider, exporter) = test_tracer(&rt);

    let tracer = provider.tracer("test");
    for _ in 0..1000 {
        tracer.start("test_span");
    }
    // drop(tracer);

    // Should flush all batched telemetry spans
    provider.shutdown().unwrap();

    let spans = exporter.0.lock().unwrap();
    assert_eq!(spans.len(), 1000);
}

There is also a case of a lockup in the shutdown call unless it's done in a separate worker thread. I was able to reproduce and traced it to this line:

opentelemetry-rust/opentelemetry-sdk/src/trace/span_processor.rs

Line 270 in 16c0e10

futures_executor::block_on(res_receiver)

So it looks like we're not safe with the shutdown method either, but that's a subject for another bug report.

pitoniak32 · 2024-11-21T14:15:20Z

I just got bit by this too, and took me a while to find a work around, I think at least the examples / docs need to be updated.

This was my workaround
setup:

pub struct OtelGuard {
    tracer_provider: opentelemetry_sdk::trace::TracerProvider,
}

impl Drop for OtelGuard {
    fn drop(&mut self) {
        println!("Dropping OtelGuard!");
        println!("Shutting down TracerProvider!");
        self.tracer_provider.shutdown().expect("TracerProvider should shutdown properly");
    }
}

pub fn setup_tracing_subscriber() -> anyhow::Result<OtelGuard> {
    let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
        .with_batch_exporter(
            SpanExporter::builder()
                .with_tonic()
                .with_endpoint("grpc://localhost:4317")
                .build()?,
            runtime::Tokio,
        )
        .build();
    
    global::set_tracer_provider(tracer_provider.clone());
    
    tracing::subscriber::set_global_default(
        tracing_subscriber::registry()
            .with(tracing_subscriber::fmt::layer())
            .with(OpenTelemetryLayer::new(tracer_provider.tracer(TRACER_NAME))),
    )
    .unwrap();
    
    Ok(OtelGuard { tracer_provider })
}

main:

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let _otel_guard = otel::setup_tracing_subscriber()?;
    
    let root_span = info_span!("test");
    let _span_gaurd = root_span.enter();
    
    let fruit = "watermelon";
    
    tracing::debug!("Fruit was {fruit}");
    tracing::info!("Fruit was {fruit}");
    tracing::warn!("Fruit was {fruit}");
    tracing::error!("Fruit was {fruit}");
    
    Ok(())
}

lalitb · 2024-11-21T17:24:43Z

To clarify, the shutdown_tracer_provider will eventually be removed from the API surface. Alternatively, it might be moved to the SDK and renamed to release_tracer_provider, as its primary purpose is to release the provider from the global context. For all practical purposes, TracerProvider::Shutdown() should be used for shutting down.

pitoniak32 · 2024-11-22T13:21:13Z

@lalitb, @cijothomas - Would it be welcome to update some of the tracing examples to show this more clearly? I could see this being very frustrating to new comers of the crate trying to get it working from examples and traces not being sent.

To clarify, the shutdown_tracer_provider will eventually be removed from the API surface. Alternatively, it might be moved to the SDK and renamed to release_tracer_provider, as its primary purpose is to release the provider from the global context. For all practical purposes, TracerProvider::Shutdown() should be used for shutting down.

just updated my example snippet to be more accurate

cijothomas · 2024-11-22T15:21:40Z

@pitoniak32 I'd suggest to take a look at the Metrics example and see if it makes sense and is easy to use. If yes, lets replicate the same to Traces. This was always the plan, just that Metrics (and logs) made progress, and we left out tracing. (Sorry, we just didn't have the bandwidth to tackle all, but I realize now it'd have been better to not leave high level inconsistency like this :( )

The overall idea is:
Make the providers a cheaply cloneable struct pointing to same underlying provider. Lets make it clear in the doc that Provider is a handle to underlying provider only and is cheaply cloneable.
Once user creates a provider, they hold on to it, so shutdown can be called at the end. -- this is the key part.
If there is a need to set the provider globally (so that other parts of the app or libraries can retrieve it), clone and set it.

Metrics and Logs are already working like this. (Small difference: Logs does not have the need to set global)
Lets make Traces consistent with rest of the signals, and then the doc/example would naturally be consistent and won't suffer from the confusion raised in this issue.

@pitoniak32 Something you can help to make happen?
(Only one thing to note, if planning to work on this immediately: the next release 0.27.1 should not contain any breaking change - we can still mark things obsolete, and do 0.28 with breaking changes, if any needed.)

pitoniak32 · 2024-11-22T15:28:31Z

Yeah I can take a look at this! Makes sense, thank you for the outline!

Totally understand, there's a ton of stuff to maintain in here!

cijothomas · 2024-12-14T02:37:53Z

I think this issue can be closed now. @pitoniak32 do you see anything remaining to be tackled for this?

Because we're using the batch provider, and span information is sent when the span *exits*, if we just let the process exit immediately, we might lose some tracing data; The [recommended pattern](open-telemetry/opentelemetry-rust#1961 (comment)) is to hold onto the providers and shut them down manually as the process exits. This will wait for any spans to finish shipping and avoid losing data. Note, that we might want another pass at this in the future: - integrate it into the panic handler that I added in another branch - integrate something like [Tokio Graceful Shutdown](https://docs.rs/tokio-graceful-shutdown/latest/tokio_graceful_shutdown/) to intercept ctrl+C and the like - add a timeout, so that a stalled metrics writer doesn't wait forever I kept it simple for this PR, but just something we should keep in mind

cijothomas · 2025-01-24T15:53:42Z

Closing this issue. All examples are modified to show the right way to perform shutdown() now.

stepantubanov added bug Something isn't working triage:todo Needs to be traiged. labels Jul 24, 2024

stepantubanov changed the title ~~[Bug]: No way to shutdown properly when used with global subscriber~~ [Bug]: Shutdown doesn't flush when used with global subscriber Jul 24, 2024

maurocchi mentioned this issue Sep 11, 2024

[PAPAY-1395]: Explicitely shutdown the tracer provider primait/prima_tracing.rs#113

Merged

h-a-n-a mentioned this issue Oct 14, 2024

Adding tracing_opentelemetry::layer to tracing_subscriber::registry breaks span exporting to Jaeger tokio-rs/tracing-opentelemetry#159

Closed

mzabaluev mentioned this issue Oct 15, 2024

Fix the example installing OpenTelemetryLayer into a global subscriber tokio-rs/tracing-opentelemetry#175

Merged

lalitb self-assigned this Oct 15, 2024

nshalman mentioned this issue Nov 20, 2024

init-tracing-opentelemetry: Regression on flushing all traces at program termination davidB/tracing-opentelemetry-instrumentation-sdk#184

Closed

pitoniak32 mentioned this issue Nov 22, 2024

REQUEST: New membership for pitoniak32 open-telemetry/community#2452

Closed

6 tasks

nshalman mentioned this issue Nov 25, 2024

Tracking upstream bug equinix-labs/rust-otel-tools#3

Closed

pitoniak32 mentioned this issue Dec 1, 2024

chore: remove the global::shutdown_tracer_provider function #2369

Merged

4 tasks

cijothomas added the shutdown&runtime label Dec 25, 2024

cijothomas closed this as completed Jan 24, 2025

kaffarell mentioned this issue Feb 20, 2025

set_global_default does not seem to work tokio-rs/tracing#3217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

stepantubanov commented Jul 24, 2024 •

edited

Loading

lalitb commented Jul 24, 2024 •

edited

Loading

stepantubanov commented Jul 24, 2024 •

edited

Loading

lalitb commented Jul 24, 2024

mzabaluev commented Oct 15, 2024 •

edited

Loading

mzabaluev commented Oct 16, 2024 •

edited

Loading

pitoniak32 commented Nov 21, 2024 •

edited

Loading

lalitb commented Nov 21, 2024 •

edited

Loading

pitoniak32 commented Nov 22, 2024 •

edited

Loading

cijothomas commented Nov 22, 2024

pitoniak32 commented Nov 22, 2024

cijothomas commented Dec 14, 2024

cijothomas commented Jan 24, 2025

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

Comments

stepantubanov commented Jul 24, 2024 • edited Loading

What happened?

API Version

SDK Version

What Exporter(s) are you seeing the problem on?

Relevant log output

lalitb commented Jul 24, 2024 • edited Loading

stepantubanov commented Jul 24, 2024 • edited Loading

lalitb commented Jul 24, 2024

mzabaluev commented Oct 15, 2024 • edited Loading

mzabaluev commented Oct 16, 2024 • edited Loading

pitoniak32 commented Nov 21, 2024 • edited Loading

lalitb commented Nov 21, 2024 • edited Loading

pitoniak32 commented Nov 22, 2024 • edited Loading

cijothomas commented Nov 22, 2024

pitoniak32 commented Nov 22, 2024

cijothomas commented Dec 14, 2024

cijothomas commented Jan 24, 2025

stepantubanov commented Jul 24, 2024 •

edited

Loading

lalitb commented Jul 24, 2024 •

edited

Loading

stepantubanov commented Jul 24, 2024 •

edited

Loading

mzabaluev commented Oct 15, 2024 •

edited

Loading

mzabaluev commented Oct 16, 2024 •

edited

Loading

pitoniak32 commented Nov 21, 2024 •

edited

Loading

lalitb commented Nov 21, 2024 •

edited

Loading

pitoniak32 commented Nov 22, 2024 •

edited

Loading