-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance: high latency when using otel libs inside common endpoints #4741
Comments
Hi thanks for the detailed report! I've had a quick look at the benchmark code, and it looks like the Consider using the |
Thank you for your response @pichlermarc I added Isolated OTEL JS BenchmarksGraphQL Debugger BenchmarksSummaryThe benchmarks indicate that using the BatchProcessor is significantly more performant than the SimpleProcessor. Thank you again for pointing this out. I have some questions: Our benchmarks only produce a single span. Given there is nothing to batch, how does changing this processor affect the performance? Does this mean the SimpleProcessor is exporting the spans inline synchronously? Is this even possible? Are the differences between the two processors documented anywhere? Could the logic be wrapped in a setTimeout to prevent blocking the immediate loop? Would it be wise to add a log in the console when users use SimpleProcessor? |
Glad to hear that this improved things. 🙂 There's still some performance overhead, but we're hoping to gradually improve this in the future by optimizing the SDK and Exporters.
It batches spans over multiple requests, once the batch is large enough it will then export the spans.
It does some work synchronously, namely serializing the OTLP message and starting the http request it seems to be enough to block the execution.
It is documented on opentelemetry.io, but I think it would also make sense to document this in the JSDoc for the processor.
I'd go with a JSDoc addition first, then re-evaluate if we need the log. 🙂 |
Hi @pichlermarc.
I agree on the overhead. Even with using the Batch processor, in the screenshot, from our benchmarks, it shows that adding a few lines of otel js, using the batch processor, increases latency from Due to these findings, I made a PR to move the benchmarks I am using into this repo: I have added these benchmarks to the source code of OpenTelemetry as they provide transparent insight into performance plus allow the community to iterate on performance improvements.
Once we have some benchmarks in place we could experiment with things like this, as introducing a timeout and injecting in the current timestamp should not affect people's applications.
Added here: |
It looks like at this point this became a documentation issue which has been updated (thank you!). We'll close this out and leave the other PR open for the benchmarks as that is still being looked into. |
What happened?
I am writing this issue to express the performance issues I've encountered using your libraries within common Node.js APIs such as HTTP, Express, and GraphQL. I have some links, benchmarks and examples to share with you below.
Related:
Background
We are building an observability platform leveraging OTEL JS on the users’ client side. It was reported to us that using our libraries with their APIs resulted in massive latency issues. We did some investigation broken down in full detail here:
Our implementation essentially wraps a GraphQL resolver, picking up on values in the tree info argument, capturing attributes, and starting a span. As described in our adjacent issue, we took several necessary steps to ensure it wasn’t an issue with our implementation. Additionally, we created a few benchmarks with and without GraphQL usage to compare performance impacts.
Hypothesis
We see that the code added in the following diff is causing a huge increase in latency to this endpoint.
Alongside our benchmarks, as detailed below, we saw the following as confirmation of our hypothesis:
Benchmarks
When we ran with our hypothesis, we found a bunch of benchmarks like opentelemetry-java/benchmarks/ and #4171. We found these benchmarks to be not that helpful, and we were unable to compare the two languages or understand how OTEL JS affects performance inside an endpoint.
From this, we created a few types of benchmarks to test our findings and improve our performance.
Isolated OTEL JS Benchmarks
These benchmarks demonstrate using OTEL inside a basic endpoint. Our results show that adding just a few lines of OTEL code to your handler will result in a significant reduction in the performance of your API. For example, a basic http endpoint operating at
6.26ms
latency more than triples the average time to22.03ms
when OTEL is added, rendering it unusable for any production setting.GraphQL Debugger Benchmarks
GraphQL Specific fork of isolated benchmarks:
The text was updated successfully, but these errors were encountered: