Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel Memory Leak #6315

Open
tjsampson opened this issue Nov 8, 2024 · 14 comments
Open

OTel Memory Leak #6315

tjsampson opened this issue Nov 8, 2024 · 14 comments
Labels
bug Something isn't working response needed Waiting on user input before progress can be made

Comments

@tjsampson
Copy link

tjsampson commented Nov 8, 2024

Description

Memory Leak in otel library code.

Environment

  • OS: Linux
  • Architecture: x86
  • Go Version: 1.23.2
    - go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.56.0
    • go.opentelemetry.io/otel v1.31.0
    • go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.31.0
    • go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.31.0
    • go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.31.0
    • go.opentelemetry.io/otel/sdk v1.31.0
    • go.opentelemetry.io/otel/trace v1.31.0

Steps To Reproduce

See Comment here: #5190 (comment)

I am pretty sure this is still an issue or something else in the golang otel ecosystem. I will get a pprof setup possibly tomorrow, but here's some anecdotal evidence I have:

Screenshot 2024-11-07 at 6 49 14 PM

Pretty easy to see when tracing was implemented from that graph. And yes. I have removed our tracing implementation and it's back to normal memory usage.

Here is a rough draft of our setup. Please let me know if I am doing anything egregiously dumb, but for the most part, it's all pretty standard stuff take from various docs:

go.mod

go 1.23.2

require (
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.56.0
	go.opentelemetry.io/otel v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.31.0
	go.opentelemetry.io/otel/sdk v1.31.0
	go.opentelemetry.io/otel/trace v1.31.0
)

We are wrapping the otelhttp.NewHandler around the toplevel muxer, so everything is traced. Yes, I know this is expensive, but it shouldn't leak memory. Eventually we will change this to include/exclude/drop stuff, just so we aren't taking in so much volume (ping routes, health checks, etc.) and do more aggressive down sampling.

func NewApi(c config.Config) *Api {
	return &Api{
		c:           &c,
		controllers: []Controller{newControllers(c)},
		server: &http.Server{
			ReadTimeout:  c.ReadTimeout,
			WriteTimeout: c.WriteTimeout,
			IdleTimeout:  c.IdleTimeout,
			Addr:         fmt.Sprintf(":%d", c.Port),
			Handler:      otelhttp.NewHandler(chi.NewMux(), "INGRESS", otelhttp.WithFilter(traceFilter)),
		},
		done:       make(chan bool),
		sigChannel: make(chan os.Signal, 1024),
	}
}

Here is how we are initializing our trace and metrics providers once on boot:

// TracerProvider an OTLP exporter, and configures the corresponding trace provider.
func TracerProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	// Set up a trace exporter
	traceExporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient())
	if err != nil {
		return nil, errors.Wrap(err, "failed to create trace exporter")
	}

	// Register the trace exporter with a TracerProvider, using a batch
	// span processor to aggregate spans before export.
	tracerProvider := sdktrace.NewTracerProvider(
		sdktrace.WithSampler(sdktrace.AlwaysSample()),
		sdktrace.WithResource(res),
		sdktrace.WithBatcher(traceExporter),
	)
	otel.SetTracerProvider(tracerProvider)
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.TraceContext{},
			propagation.Baggage{},
		))

	// Shutdown will flush any remaining spans and shut down the exporter.
	return tracerProvider.Shutdown, nil
}

// MeterProvider an OTLP exporter, and configures the corresponding meter provider.
func MeterProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	metricExporter, err := otlpmetricgrpc.New(ctx)
	if err != nil {
		return nil, errors.Wrap(err, "failed to create metric exporter")
	}

	meterProvider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExporter)),
		sdkmetric.WithResource(res),
	)
	otel.SetMeterProvider(meterProvider)

	return meterProvider.Shutdown, nil
}

Then called and shutdown on main:

	shutDownTracer, err := traceinstrument.TracerProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Fatal("failed to create trace provider", zap.Error(err))
	}
	
	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down trace provider", zap.Error(errr))
		}
	}(shutDownTracer)

	shutdownTraceMetrics, err := traceinstrument.MeterProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Fatal("failed to create meter provider", zap.Error(err))
	}

	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down metrics provider", zap.Error(errr))
		}
	}(shutdownTraceMetrics)

Note. We are also using the otelhttp.NewTransport to wrap the default logging transport:

http.DefaultTransport = otelhttp.NewTransport(http.DefaultTransport)

If we remove tracing setup, memory usage goes back to normal. So the leak is definitely in our tracing setup.

Expected behavior

Memory does not continuously increase over time.

@tjsampson tjsampson added the bug Something isn't working label Nov 8, 2024
@github-project-automation github-project-automation bot moved this to Needs triage in Go: Triage Nov 9, 2024
@MrAlias MrAlias added the help wanted Extra attention is needed label Nov 21, 2024
@MrAlias MrAlias moved this from Needs triage to High priority in Go: Triage Nov 21, 2024
@MrAlias MrAlias changed the title Otel Memory Leak OTel Memory Leak Nov 21, 2024
@MrAlias MrAlias added this to the v1.33.0 milestone Nov 21, 2024
@MrAlias

This comment was marked as resolved.

@MrAlias

This comment was marked as resolved.

@MrAlias
Copy link
Contributor

MrAlias commented Dec 11, 2024

@tjsampson are you using cumulative temporality (the default IIRC for the metric gRPC exporter)?

Is the cardinality of your metrics unbounded?

@MrAlias
Copy link
Contributor

MrAlias commented Dec 11, 2024

What is the definition of traceFilter? Does it hold state?

@MrAlias
Copy link
Contributor

MrAlias commented Dec 11, 2024

Do you have any pprof data for the memory usage?

@MrAlias MrAlias moved this from High priority to Needs triage in Go: Triage Dec 11, 2024
@MrAlias MrAlias added response needed Waiting on user input before progress can be made and removed help wanted Extra attention is needed labels Dec 11, 2024
@MrAlias MrAlias moved this from Needs triage to Triaging in Go: Triage Dec 11, 2024
@MrAlias MrAlias removed this from the v1.33.0 milestone Dec 11, 2024
@tjsampson
Copy link
Author

tjsampson commented Dec 15, 2024

What is the definition of traceFilter? Does it hold state?

func traceFilter(req *http.Request) bool {
	skipTraceAgents := []string{"kube-probe", "ELB-HealthChecker"}
	ua := req.UserAgent()

	for _, skipAgent := range skipTraceAgents {
		if strings.Contains(ua, skipAgent) {
			return false
		}
	}
	return true
}

Possibly the issue?

@tjsampson
Copy link
Author

@tjsampson are you using cumulative temporality (the default IIRC for the metric gRPC exporter)?

Yes. We are using the default.

@tjsampson
Copy link
Author

Is the cardinality of your metrics unbounded?

No, they aren't unbounded. We aren't actually using the otel meter provider for any custom metrics (still using prometheus for that). We are using it for the default metrics so that we can link metrics to traces inside Grafana. So, unless the default metrics are unbounded, we should be safe (no custom metrics).

@MrAlias
Copy link
Contributor

MrAlias commented Dec 18, 2024

What is the definition of traceFilter? Does it hold state?

func traceFilter(req *http.Request) bool {
	skipTraceAgents := []string{"kube-probe", "ELB-HealthChecker"}
	ua := req.UserAgent()

	for _, skipAgent := range skipTraceAgents {
		if strings.Contains(ua, skipAgent) {
			return false
		}
	}
	return true
}

Possibly the issue?

I don't see a way for this to grow. Guessing this isn't the issue.

@MrAlias
Copy link
Contributor

MrAlias commented Dec 18, 2024

Are you able to collect pprof memory data? It's is hard to say where the allocations are going at this point without it.

@tjsampson
Copy link
Author

tjsampson commented Dec 18, 2024

Screenshot 2024-12-18 at 11 05 17 AM

@MrAlias The leak is pretty slow. We just deployed the service a couple of days ago, as seen in the graph, but its steadily climbing. I've been making tweaks/changes to that code. I will post what we are currently running, just for posterity.

@tjsampson
Copy link
Author

main.go

	var (
		ctx, cancel = context.WithCancel(context.Background())
		cfg         = config.Boot()
		err         error
	)
	defer cancel()

        traceRes, err := traceinstrument.TraceResource(ctx)
	if err != nil {
		log.Logger.Panic("failed to create trace resource", zap.Error(err))
	}

	shutDownTracer, err := traceinstrument.TracerProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Panic("failed to create trace provider", zap.Error(err))
	}

	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down trace provider", zap.Error(errr))
		}
	}(shutDownTracer)

	shutdownTraceMetrics, err := traceinstrument.MeterProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Panic("failed to create meter provider", zap.Error(err))
	}

	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down metrics provider", zap.Error(errr))
		}
	}(shutdownTraceMetrics)

        .... do other stuff.....

instrument.go

func TraceResource(ctx context.Context) (*resource.Resource, error) {
	var (
		ciEnv            = os.Getenv("CI_ENVIRONMENT") 
		cloudEnvironment = os.Getenv("CLOUD_ENVIRONMENT") 
		attribs          = []attribute.KeyValue{serviceName, serviceVersion}
	)
	if ciEnv != "" {
		attribs = append(attribs, attribute.String("environment.ci", ciEnv))
	}
	if cloudEnvironment != "" {
		attribs = append(attribs, attribute.String("environment.cloud", cloudEnvironment))
	}

	return resource.New(ctx, resource.WithAttributes(attribs...))
}

// TracerProvider an OTLP exporter, and configures the corresponding trace provider.
func TracerProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	// If not enabled, use a no-op tracer provider.
	if !tracingEnabled() {
		log.Logger.Warn("ENABLE_TRACING false, using noop tracer provider")
		tp := traceNoop.NewTracerProvider()
		otel.SetTracerProvider(tp)
		return func(ctx context.Context) error {
			return nil
		}, nil
	}

	// Set up a trace exporter
	traceExporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient())
	if err != nil {
		return nil, errors.Wrap(err, "failed to create trace exporter")
	}

	// Register the trace exporter with a TracerProvider, using a batch
	// span processor to aggregate spans before export.
	tracerProvider := sdktrace.NewTracerProvider(
		sdktrace.WithSampler(sdktrace.AlwaysSample()),
		sdktrace.WithResource(res),
		sdktrace.WithBatcher(traceExporter),
	)
	otel.SetTracerProvider(tracerProvider)
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.TraceContext{},
			propagation.Baggage{},
		))


	// Shutdown will flush any remaining spans and shut down the exporter.
	return tracerProvider.Shutdown, nil
}

// MeterProvider an OTLP exporter, and configures the corresponding meter provider.
func MeterProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	// If not enabled, use a no-op meter provider.
	if !tracingEnabled() {
		log.Logger.Warn("ENABLE_TRACING false, using noop meter provider")
		mp := metricNoop.NewMeterProvider()
		otel.SetMeterProvider(mp)
		return func(ctx context.Context) error {
			return nil
		}, nil
	}

	metricExporter, err := otlpmetricgrpc.New(ctx)
	if err != nil {
		return nil, errors.Wrap(err, "failed to create metric exporter")
	}

	meterProvider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExporter)),
		sdkmetric.WithResource(res),
	)
	otel.SetMeterProvider(meterProvider)

	return meterProvider.Shutdown, nil
}

func tracingEnabled() bool {
	traceEnabled := os.Getenv("ENABLE_TRACING")
	enabled, err := strconv.ParseBool(traceEnabled)
	if err != nil {
		return false
	}
	return enabled
}

func CloudCustomerSpanEvent(ctx context.Context, evt string) {
	span := trace.SpanFromContext(ctx)
	bag := baggage.FromContext(ctx)
	tc := attribute.Key("customer")
	cust := bag.Member("customer")
	span.AddEvent(evt, trace.WithAttributes(tc.String(cust.Value())))
}

server.go

		server := &http.Server{
			ReadHeaderTimeout: time.Second * 5, 
			ReadTimeout:       c.ReadTimeout,
			WriteTimeout:      c.WriteTimeout,
			IdleTimeout:       c.IdleTimeout,
			Addr:              fmt.Sprintf(":%d", c.Port),
			Handler:           otelhttp.NewHandler(chi.NewMux(), "INGRESS"),
		},

go.mod

	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.58.0
	go.opentelemetry.io/otel v1.32.0
	go.opentelemetry.io/otel/trace v1.32.0
         go.opentelemetry.io/contrib/propagators/b3 v1.32.0
	go.opentelemetry.io/otel v1.32.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.32.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.32.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.32.0
	go.opentelemetry.io/otel/metric v1.32.0
	go.opentelemetry.io/otel/sdk v1.32.0
	go.opentelemetry.io/otel/sdk/metric v1.32.0

@tjsampson
Copy link
Author

I've got this hooked up to Grafana Pyroscope and doing some continuous profiling. It typically takes a few days for the leak to really show itself, just because of how slow it is. From what I can tell early on is that it seems to be around these calls:

- go.opentelemetry.io/otel/sdk/metric/exemplar.newStorage
- google.golang.org/grpc/mem.NewTieredBufferPool.newSizedBufferPool
- go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor

I am going to try and get some heap dumps periodically over the next few days. However, given that it's the Holidays, I am not sure if the lower levels of volume/traffic in these test/dev environments will produce the same effect, so might have to wait until after the new year.

@boscar
Copy link

boscar commented Dec 30, 2024

Experiencing a similar issue. From analysing pprof heap files it is possible connected to metric.PeriodicReader and compute aggregation in metric.pipeline.

Not sure if this is the same issue that @tjsampson is having, but the symptoms are similar (slow memory leak that can be observed over days).

Screenshot 2024-12-30 at 11 27 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working response needed Waiting on user input before progress can be made
Projects
Status: Triaging
Development

No branches or pull requests

3 participants