-
Notifications
You must be signed in to change notification settings - Fork 236
Capture GenAI prompts and completions as events or attributes #2010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Proposal We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default. If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need We'll still need to address the large|sensitive-content-by-reference in the future:
When this comes along, we'll provide a new way to opt-in into a new solution which might replace or could coexist with attributes. Stamping them as attributes now would allow us to provide a simple solution for existing tooling and some of less mature applications and would not block us from making progress on the proper solution for the large-content problem. |
So happy to see this proposal!
Another concern is about the long-term memory costs for in-proc telemetry tools.
Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.
Yes! In fact, this is what I think is ideal. But I believe we still have a lot of work to do before that day arrives:
|
Another way is to store a preview of that prompts/completion. Say, the first 1000 tokens, which will make the user easier for trouble shooting, at least they know what the prompts/completion are about.
Another way to solve this, is to keep the original data, and use OTel collector to remove the sensitive content if user wants to. |
I like this proposal! Coming from the experience with Spring AI, I see the value in having prompts and completions contextually available as part of a span (it's also the current behavior in the framework). It's a bit unfortunate span events have been deprecated, as that would have been my primary suggestion instead of span attributes. |
I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?
👍 I do want to add a few concerns I don't think we've discussed yet
|
Thank you for being open to revisit this decision. The initial experiment with log events was indeed both unpopular and expensive. Span attributes is the way to meet the ecosystem where they are, and allow us to focus on problems of genai instead of obscure infrastructure. It also allows not just the existing vendor ecosystem, but also systems like Jaeger to be first class spec implementations again. More details below since not everyone has had as long history with the prior state as I've had. I hope it helps: The High Cost of Log EventsThe current approach to events has demanded significant effort with limited payoff. Elastic alone has invested close to—or possibly more than—a person-year on this topic. This effort spanned:
UX made more difficultI’ve personally worked with projects like otel-tui and Zipkin to add log support specifically for this spec. The experience was more navigation than before with no benefit. Since otel only has a few genai instrumentation, you end up relying on 3rd parties like langtrace, openinference or openllmetry to fill in the gaps. Most use span attributes, so the full trace gets very confusing where some parts are in attributes and others need clicking or typing around to figure out the logs attached to something. Focus imperativeI'm not alone in needing a couple hours a day just for GenAI ecosystem change. We have to make decisions that relate to the focus that's important. A deeply troubled technical choice hurts our ability to stay relevant, as these problems steal time from what is. This is a primary reason so few vendors adopted the log events. In personal comments to me, many said they simply cannot afford to redo their arch and UX just to satisfy the log events. It would come at the cost of time spent in service to customers, so just didn't happen. We have options, but they are far less with log events.Since this started, we looked at many ways to get things to work. While most didn't pan out and/or were theoretical (collector processor can in theory do anything), we have a lot of options if we flip the tables back as suggested::
We don't have to boil the spec ocean in this decisionThis decision is about chat completions and similar APIs such as responses. It does not have to be the same decision for an as-yet unspecified semantic mapping for real time APIs. We shouldn't hold this choice hostage to unexplored areas which would vary wildly per vendor. chat completions is a high leverage API and many inference platforms support it the same way, by emulating OpenAI. Let's not get distracted about potential other APIs which might not fit. ConclusionIn summary, the experiment with events, especially logs, taught us valuable lessons but proved too costly and unpopular to sustain. By focusing on span attributes, we can reduce complexity, improve UX, and align with the ecosystem’s strengths—paving the way for a spec the community will embrace. I’m excited to see this revisited and look forward to refining OTel together, as a group, not just those who have log ingress like Elastic. |
Just popping in here to say that I think this is the right proposal. Whether it's the ideal way or not, most tools for evals (and arguably production monitoring of AI behavior) treat traces as the source of truth instead of connective tissue between other things. |
If storing large and sensitive content on the telemetry is a problem (it is) we'll have to solve it. Companies that work with enterprise customers would need to find a solution to this regardless of the signal large and sensitive data is recorded on.
Our current events don't work for realtime API and don't even attempt to cover multi-modal content. If we need to record point-in-time telemetry, the events would be the right choice.
OTel assumes you always create client and server-side spans for each network call. We don't document how to model GenAI proxy server spans and events and I don't believe they should repeat client ones anyway. The TL;DR: |
I've been thinking about these two extreme cases, and a spectrum of options between them:
Given that we're not dealing with real events (it's not point-in-time or independent from spans signal), it's weird to make local and early-days development experience harder - install extra components, apply extra config. So I consider prompts and completions (also tool definitions) to be verbose attributes. |
Let's also consider different verbosity profiles GenAI data could have (in the ascending order):
We can provide two config options:
It should be possible to configure them independently (e.g. no full content, but event per chunk) |
Hear me out: span events. Yes, really, seriously, honestly, genuinely, sincerely. We've dismissed them because they're being deprecated, which isn't necessarily true. open-telemetry/opentelemetry-specification#4430 says that the plan is deprecate the span events API, but allow emitting them through the logging API. It gives this motivation:
which is exactly the situation here. So we have an OTel-approved way that can make everyone happy. Specifically, I propose:
Benefits:
|
Something that's tickling the back of my mind here is that I think we're dancing around some patterns and use cases but not mapping them out explicitly. The field is emerging, so we can't really be exhaustive, but laying out some of the needs for how you accomplish particular things seems important. Some stuff that comes to mind which implies some pretty different needs related to representing, emitting, storing, and querying data:
Each is also different in terms of its prominence and support in the broader tooling ecosystem. I would propose, for example, that far more organizations are doing fairly basic RAG that produces a structured JSON object for B2B SaaS than doing realtime AI streaming audio responses for consumer applications. The former group certainly has more tools available to help improve quality. Anyways, that screed aside, I think it's hard for me to think about what existing data format I prefer over another because I don't yet have the best sense for which would be good/bad/terrible for each of the things I listed above. Really the only guiding principle I have is that a lot of existing tools just put prompts and responses on spans and call it a day, and so it's helpful to generally align with that for now. |
Great idea to list different areas of interest that we'd like to cover. Some thoughts:
|
Based on today's spec and logs SIG discussions:
I'm going to come up with a more specific proposal along the lines of #1428 |
First of all, thanks for putting this proposal together. In my humble opinion, this is a very reasonable medium term approach. Separately, I think pass by reference to an external datastore + expanding OTEL to provide a side-channel for this reference will eventually be needed. I say this because when we are dealing with the likes of base64 encoded audio bytes for TTS/STT models or OpenAI's realtime API, we will need a way forward to accommodate this data format for users who choose to trace them as part of their systems.
In the medium term, I fully agree with this approach and think this is the only way to deal with how fast the space is evolving. |
What do you mean by opt-in? Just that capturing content needs to be enabled? |
Thanks @lmolkova for taking this and I'm sorry that I couldn't get to it before. |
Here's a more detailed proposal - Modeling GenAI calls on telemetry, the prototype on OpenAI instrumentation TL;DR: The proposal:
So if you want this data on your telemetry - pick attributes, events, or even both - they are independent. Backends would recommend a configuration that works for them. As you app matures, volume grows, compliance becomes important - forward this content to external storage. Check out the prototype - open-telemetry/opentelemetry-python-contrib#3397 - it's not hard to implement. |
@alexmojaki speaking about span events - they have the same fundamental problems as attributes.
They allow to express heterogeneous arrays and one-level-deep maps better than span attributes, but it can't be the only reason to keep span events alive. Let's keep pushing for structured attributes on spans - open-telemetry/opentelemetry-specification#4468 - we're going to discuss it on the spec call on Tue 4/7 |
These are not what I consider the problems of span attributes. I have listed other ways in which span events work better than span attributes.
If you set
It's not. Span events aren't being deprecated. What's the problem?
Happy to have that too, but like external storage that's a big long term project. |
On the deprecation of span events: https://github.com/open-telemetry/opentelemetry-specification/pull/4430/files#r2044286664 |
@lmolkova I'd like to try advancing the discussion here instead of relying only on the SIG call, that didn't work last week and I'm not sure I can attend tomorrow. Please can you explain why you don't think span events allow everyone to be happy? Even if open-telemetry/opentelemetry-specification#4430 is merged as is and all my comments are ignored, it still implies that there will be a standard way to convert log events to span events. That means that even if the GenAI spec stuck to log events as the only option for prompts/completions, there would still be a way for users to configure their SDK to put everything into the span in a way that backends can be expected to understand. And IMO this would work better than JSON span attributes and maybe even complex span attributes. |
I believe what you're saying in your comment #2010 (comment) that from semconv perspective you'd prefer the following solution:
If I understand correctly, then span events are the backend choice - not the contract we document in the semconv. You can create them from attributes or from log based events if you want to. The problems I want to solve with this proposal:
So, if we pick log-based events per message as the only option, we make 1) local/basic experience more difficult 2) fully secure prod experience harder (writing/reading N tiny blobs with messages in the history is harder than working with 1 larger blob and you'd need to analyze it as a whole anyway). I feel it's important to mention that span events won't solve structured or multimodal binary content problems - GenAI data is complex and deeply nested. We'll need to json string binary payloads, tool calls, tool arguments and other things. I'd love to jump on a call to discuss the details. We have a GenAI SIG APAC call at 11pm PT today (8am CET tomorrow), I'll be there if you can join or feel free to ping me on slack. |
A few more reasons that made me come up with this proposal are:
|
The GenAI SIG has been discussing how to capture prompts and completion for a while and there are several issues that are blocked on this discussion (#1913, #1883, #1556)
What we have today on OTel semconv is a set of structured events exported separately from spans. This was implemented in #980 and discussed in #834 and #829. The motivation to use events was to
Turns out that:
So, the GenAI SIG is re-litigating this decision taking into account backends' feedback and other relevant issues: #1621, #1912, open-telemetry/opentelemetry-specification#4414
The fundamental question is whether this data belongs on the span or is a separate thing useful without a span.
How it can be useful without a span:
To be useful without a span, events should probably duplicate some of the span attributes - endpoint, model used, input parameters, etc - it's not the case today
Are prompts/completions point-in-time telemetry?
Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events
Another fundamental question is how and if to capture unbounded (text, video, audio, etc) data on telemetry
It's problematic because of:
Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.
Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)
TL;DR:
The text was updated successfully, but these errors were encountered: