-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTel Memory Leak #6315
Comments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@tjsampson are you using cumulative temporality (the default IIRC for the metric gRPC exporter)? Is the cardinality of your metrics unbounded? |
What is the definition of |
Do you have any pprof data for the memory usage? |
func traceFilter(req *http.Request) bool {
skipTraceAgents := []string{"kube-probe", "ELB-HealthChecker"}
ua := req.UserAgent()
for _, skipAgent := range skipTraceAgents {
if strings.Contains(ua, skipAgent) {
return false
}
}
return true
} Possibly the issue? |
Yes. We are using the default. |
No, they aren't unbounded. We aren't actually using the otel meter provider for any custom metrics (still using prometheus for that). We are using it for the default metrics so that we can link metrics to traces inside Grafana. So, unless the default metrics are unbounded, we should be safe (no custom metrics). |
I don't see a way for this to grow. Guessing this isn't the issue. |
Are you able to collect pprof memory data? It's is hard to say where the allocations are going at this point without it. |
@MrAlias The leak is pretty slow. We just deployed the service a couple of days ago, as seen in the graph, but its steadily climbing. I've been making tweaks/changes to that code. I will post what we are currently running, just for posterity. |
main.go var (
ctx, cancel = context.WithCancel(context.Background())
cfg = config.Boot()
err error
)
defer cancel()
traceRes, err := traceinstrument.TraceResource(ctx)
if err != nil {
log.Logger.Panic("failed to create trace resource", zap.Error(err))
}
shutDownTracer, err := traceinstrument.TracerProvider(ctx, traceRes)
if err != nil {
log.Logger.Panic("failed to create trace provider", zap.Error(err))
}
defer func(onShutdown func(ctx context.Context) error) {
if errr := onShutdown(ctx); errr != nil {
log.Logger.Error("error shutting down trace provider", zap.Error(errr))
}
}(shutDownTracer)
shutdownTraceMetrics, err := traceinstrument.MeterProvider(ctx, traceRes)
if err != nil {
log.Logger.Panic("failed to create meter provider", zap.Error(err))
}
defer func(onShutdown func(ctx context.Context) error) {
if errr := onShutdown(ctx); errr != nil {
log.Logger.Error("error shutting down metrics provider", zap.Error(errr))
}
}(shutdownTraceMetrics)
.... do other stuff..... instrument.go func TraceResource(ctx context.Context) (*resource.Resource, error) {
var (
ciEnv = os.Getenv("CI_ENVIRONMENT")
cloudEnvironment = os.Getenv("CLOUD_ENVIRONMENT")
attribs = []attribute.KeyValue{serviceName, serviceVersion}
)
if ciEnv != "" {
attribs = append(attribs, attribute.String("environment.ci", ciEnv))
}
if cloudEnvironment != "" {
attribs = append(attribs, attribute.String("environment.cloud", cloudEnvironment))
}
return resource.New(ctx, resource.WithAttributes(attribs...))
}
// TracerProvider an OTLP exporter, and configures the corresponding trace provider.
func TracerProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
// If not enabled, use a no-op tracer provider.
if !tracingEnabled() {
log.Logger.Warn("ENABLE_TRACING false, using noop tracer provider")
tp := traceNoop.NewTracerProvider()
otel.SetTracerProvider(tp)
return func(ctx context.Context) error {
return nil
}, nil
}
// Set up a trace exporter
traceExporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient())
if err != nil {
return nil, errors.Wrap(err, "failed to create trace exporter")
}
// Register the trace exporter with a TracerProvider, using a batch
// span processor to aggregate spans before export.
tracerProvider := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithResource(res),
sdktrace.WithBatcher(traceExporter),
)
otel.SetTracerProvider(tracerProvider)
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
// Shutdown will flush any remaining spans and shut down the exporter.
return tracerProvider.Shutdown, nil
}
// MeterProvider an OTLP exporter, and configures the corresponding meter provider.
func MeterProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
// If not enabled, use a no-op meter provider.
if !tracingEnabled() {
log.Logger.Warn("ENABLE_TRACING false, using noop meter provider")
mp := metricNoop.NewMeterProvider()
otel.SetMeterProvider(mp)
return func(ctx context.Context) error {
return nil
}, nil
}
metricExporter, err := otlpmetricgrpc.New(ctx)
if err != nil {
return nil, errors.Wrap(err, "failed to create metric exporter")
}
meterProvider := sdkmetric.NewMeterProvider(
sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExporter)),
sdkmetric.WithResource(res),
)
otel.SetMeterProvider(meterProvider)
return meterProvider.Shutdown, nil
}
func tracingEnabled() bool {
traceEnabled := os.Getenv("ENABLE_TRACING")
enabled, err := strconv.ParseBool(traceEnabled)
if err != nil {
return false
}
return enabled
}
func CloudCustomerSpanEvent(ctx context.Context, evt string) {
span := trace.SpanFromContext(ctx)
bag := baggage.FromContext(ctx)
tc := attribute.Key("customer")
cust := bag.Member("customer")
span.AddEvent(evt, trace.WithAttributes(tc.String(cust.Value())))
} server.go server := &http.Server{
ReadHeaderTimeout: time.Second * 5,
ReadTimeout: c.ReadTimeout,
WriteTimeout: c.WriteTimeout,
IdleTimeout: c.IdleTimeout,
Addr: fmt.Sprintf(":%d", c.Port),
Handler: otelhttp.NewHandler(chi.NewMux(), "INGRESS"),
}, go.mod
|
I've got this hooked up to Grafana Pyroscope and doing some continuous profiling. It typically takes a few days for the leak to really show itself, just because of how slow it is. From what I can tell early on is that it seems to be around these calls:
I am going to try and get some heap dumps periodically over the next few days. However, given that it's the Holidays, I am not sure if the lower levels of volume/traffic in these test/dev environments will produce the same effect, so might have to wait until after the new year. |
Experiencing a similar issue. From analysing pprof heap files it is possible connected to Not sure if this is the same issue that @tjsampson is having, but the symptoms are similar (slow memory leak that can be observed over days). |
Description
Memory Leak in otel library code.
Environment
- go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.56.0
Steps To Reproduce
See Comment here: #5190 (comment)
Expected behavior
Memory does not continuously increase over time.
The text was updated successfully, but these errors were encountered: