Capture GenAI prompts and completions as events or attributes #2010

lmolkova · 2025-03-19T19:12:01Z

The GenAI SIG has been discussing how to capture prompts and completion for a while and there are several issues that are blocked on this discussion (#1913, #1883, #1556)

What we have today on OTel semconv is a set of structured events exported separately from spans. This was implemented in #980 and discussed in #834 and #829. The motivation to use events was to

overcome size limits on attribute values by using event body
use a signal that supports structured body and attributes
have a clear 1:1 relationship between event name and structure (as opposed to polymorphic types or arrays of heterogeneous objects)
make it possible and easy to consume individual events and prompts/completions without spans
have verbosity controls

Turns out that:

after ~9 months events are still not adopted by GenAI-focused tracing tools and their external instrumentation libs including Arize, Traceloop, Langtrace - all these providers use span attributes to capture prompts and completions.
These backends consume prompts and completions along with spans and don't envision separating them - they store and visualize this data altogether

So, the GenAI SIG is re-litigating this decision taking into account backends' feedback and other relevant issues: #1621, #1912, open-telemetry/opentelemetry-specification#4414

The fundamental question is whether this data belongs on the span or is a separate thing useful without a span.

How it can be useful without a span:

audit logs - https://cloud-native.slack.com/archives/C06KR7ARS3X/p1742322601090389?thread_ts=1741895340.932419&cid=C06KR7ARS3X - we could capture them on the request-response payloads where they are not unified/filtered/altered in other ways. But also audit logs have different delivery guarantees/storage/retention needs than telemetry
some applications don't use tracing and rely on logs

To be useful without a span, events should probably duplicate some of the span attributes - endpoint, model used, input parameters, etc - it's not the case today

Are prompts/completions point-in-time telemetry?

they don't really have a timestamp - prompts are input parameters, completion comes at the end of non-streaming call, buffered completion comes at the end of the streaming call. (Timestamp granularity and log ordering #1701, Please revisit decision to emit an event for every message #1621 (comment))
Streaming chunks, if captured at all, would have timestamps (Add the option of streaming gen_ai.choice events. #1964)

Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events

Another fundamental question is how and if to capture unbounded (text, video, audio, etc) data on telemetry

It's problematic because of:

privacy - prompts can contain health concerns, ssns, addresses, names, etc. Apps that remain compliant with different regulators would have a problem of sharing this data with a broad audience of DevOps humans. The data should be accessible for evaluations, audit, but access should be restricted
size - non-GenAI specific backends are not optimized for this and it's expensive to store such data in hot storage.

Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.

Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)

TL;DR:

current approach doesn't work, we're blocked and need to find path forward.
GenAI-focused backends, innerloop scenarios, non-production apps would benefit from having prompts/completions stamped on the spans directly
General-purpose observability backends and high-scale applications would have a problem with sensitive/large/binary data coming from end-users on telemetry anyway

lmolkova · 2025-03-19T19:12:23Z

Proposal

We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default.

If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need gen_ai.inputs, gen_ai.outputs, gen_ai.tool.calls, gen_ai.tool.definitions, etc - we'll figure it out.
(Related: open-telemetry/opentelemetry-specification#4446)

We'll still need to address the large|sensitive-content-by-reference in the future:

it will involve a separate set of attributes to record references
the content will likely be stored separately from spans or logs

When this comes along, we'll provide a new way to opt-in into a new solution which might replace or could coexist with attributes.

Stamping them as attributes now would allow us to provide a simple solution for existing tooling and some of less mature applications and would not block us from making progress on the proper solution for the large-content problem.

Cirilla-zmh · 2025-03-20T05:26:29Z

So happy to see this proposal!

Streaming chunks, if captured at all, would have timestamps (#1964)

Another concern is about the long-term memory costs for in-proc telemetry tools.

Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events.

Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.

Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.

Yes! In fact, this is what I think is ideal. But I believe we still have a lot of work to do before that day arrives:

We still need to format the data to prevent it from becoming heterogeneous.
We may also need to provide some implementation/best practice use cases to show how the observables backend/evaluators get this data properly.

ralf0131 · 2025-03-20T07:34:51Z

Proposal

We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default.

Another way is to store a preview of that prompts/completion. Say, the first 1000 tokens, which will make the user easier for trouble shooting, at least they know what the prompts/completion are about.

If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need gen_ai.inputs, gen_ai.outputs, gen_ai.tool.calls, gen_ai.tool.definitions, etc - we'll figure it out. (Related: open-telemetry/opentelemetry-specification#4446)

We'll still need to address the large|sensitive-content-by-reference in the future:

it will involve a separate set of attributes to record references

the content will likely be stored separately from spans or logs

Another way to solve this, is to keep the original data, and use OTel collector to remove the sensitive content if user wants to.

ThomasVitale · 2025-03-20T14:18:01Z

I like this proposal! Coming from the experience with Spring AI, I see the value in having prompts and completions contextually available as part of a span (it's also the current behavior in the framework). It's a bit unfortunate span events have been deprecated, as that would have been my primary suggestion instead of span attributes.

aabmass · 2025-03-20T17:41:48Z

Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)

I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?

Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.

👍

I do want to add a few concerns I don't think we've discussed yet

Bidi streaming completions. Both OpenAI Realtime API and VertexAI Live API support this, where the user and model communicate over an ordered channel like a WebSocket. It's not clear how to model this as spans but events seem like a more obvious fit. I imagine this will eventually be in scope.
Server side prompt/response capturing. A proxy (like TraceLoop Hub and others) or the inference engine itself (like self hosted vLLM or an LLM vendor) could emit prompt/responses. Since you can't modify remote spans, they would need to create server side spans as well to stamp the data onto. Vs with events, you can attach a log to any span ID. Maybe the "by reference" thing would solve it.

codefromthecrypt · 2025-03-21T17:20:39Z

Thank you for being open to revisit this decision. The initial experiment with log events was indeed both unpopular and expensive. Span attributes is the way to meet the ecosystem where they are, and allow us to focus on problems of genai instead of obscure infrastructure. It also allows not just the existing vendor ecosystem, but also systems like Jaeger to be first class spec implementations again.

More details below since not everyone has had as long history with the prior state as I've had. I hope it helps:

The High Cost of Log Events

The current approach to events has demanded significant effort with limited payoff. Elastic alone has invested close to—or possibly more than—a person-year on this topic. This effort spanned:

Debates: High volume discussions, made longer due to being about an unimplemented feature of otel
Community Struggles: Angst created by committing a spec change no eval company could adopt
Implementation Challenges: Implementing anyway then having version lock-up or lack of feature clarity per language
Infrastructure Challenges: Finding out technology like ottl can't bridge log events back to the span
Portability Challenges: Knowingly limiting systems like Jaeger from being full featured due to required log support

UX made more difficult

I’ve personally worked with projects like otel-tui and Zipkin to add log support specifically for this spec. The experience was more navigation than before with no benefit. Since otel only has a few genai instrumentation, you end up relying on 3rd parties like langtrace, openinference or openllmetry to fill in the gaps. Most use span attributes, so the full trace gets very confusing where some parts are in attributes and others need clicking or typing around to figure out the logs attached to something.

Focus imperative

I'm not alone in needing a couple hours a day just for GenAI ecosystem change. We have to make decisions that relate to the focus that's important. A deeply troubled technical choice hurts our ability to stay relevant, as these problems steal time from what is. This is a primary reason so few vendors adopted the log events. In personal comments to me, many said they simply cannot afford to redo their arch and UX just to satisfy the log events. It would come at the cost of time spent in service to customers, so just didn't happen.

We have options, but they are far less with log events.

Since this started, we looked at many ways to get things to work. While most didn't pan out and/or were theoretical (collector processor can in theory do anything), we have a lot of options if we flip the tables back as suggested::

language SDKs can provide hooks to control data policy and mapping (e.g. to span events)
OTTL can do the same when everything is in the same span
There's a blob uploader API in progress which could also optionally be used for sites that have thresholds where links cause more good than harm.

We don't have to boil the spec ocean in this decision

This decision is about chat completions and similar APIs such as responses. It does not have to be the same decision for an as-yet unspecified semantic mapping for real time APIs. We shouldn't hold this choice hostage to unexplored areas which would vary wildly per vendor. chat completions is a high leverage API and many inference platforms support it the same way, by emulating OpenAI. Let's not get distracted about potential other APIs which might not fit.

Conclusion

In summary, the experiment with events, especially logs, taught us valuable lessons but proved too costly and unpopular to sustain. By focusing on span attributes, we can reduce complexity, improve UX, and align with the ecosystem’s strengths—paving the way for a spec the community will embrace. I’m excited to see this revisited and look forward to refining OTel together, as a group, not just those who have log ingress like Elastic.

cartermp · 2025-03-21T23:30:48Z

Just popping in here to say that I think this is the right proposal. Whether it's the ideal way or not, most tools for evals (and arguably production monitoring of AI behavior) treat traces as the source of truth instead of connective tissue between other things.

lmolkova · 2025-03-22T00:10:16Z

@aabmass

I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?

If storing large and sensitive content on the telemetry is a problem (it is) we'll have to solve it. Companies that work with enterprise customers would need to find a solution to this regardless of the signal large and sensitive data is recorded on.

I do want to add a few concerns I don't think we've discussed yet

Bidi streaming completions. Both OpenAI Realtime API and VertexAI Live API support this, where the user and model communicate over an ordered channel like a WebSocket. It's not clear how to model this as spans but events seem like a more obvious fit. I imagine this will eventually be in scope.

Our current events don't work for realtime API and don't even attempt to cover multi-modal content. If we need to record point-in-time telemetry, the events would be the right choice.
So streaming chunks are likely a good candidates for events. But prompts and buffered completions are not point-in-time - they don't happen at a specific time not covered by span start/end timestamps.

Server side prompt/response capturing. A proxy (like TraceLoop Hub and others) or the inference engine itself (like self hosted vLLM or an LLM vendor) could emit prompt/responses. Since you can't modify remote spans, they would need to create server side spans as well to stamp the data onto. Vs with events, you can attach a log to any span ID. Maybe the "by reference" thing would solve it.

OTel assumes you always create client and server-side spans for each network call. We don't document how to model GenAI proxy server spans and events and I don't believe they should repeat client ones anyway.

The TL;DR:
This proposal affects current events that describe prompts and fully buffered completion. If we come up with a criteria of what should be reported as an event we'd say that it should have a meaningful timestamp and could be useful without a span. It seems neither is true for the events we have today.
It doesn't mean that we should always use span attributes for every future GenAI content - absolutely not.

lmolkova · 2025-03-22T00:51:16Z

I've been thinking about these two extreme cases, and a spectrum of options between them:

Inner-loop/local experience, non-production applications, or application that can get their production obs needs satisfied with existing GenAI-focused backends/tooling. Their needs:
- easy setup, easy to use on any backend and in local observability tools (local Jaeger, otel-tui, Aspire, etc)
- verbose data, telemetry volume is not a huge concern
- don't require compliance with regulators
Prompt and buffered completion content passed on span attributes fit nicely:
- prompts and full completion don't have a timestamp
- easy OTel setup
Enterprise applications:
- need different access permissions for prompts/completions vs regular performance/usage telemetry
  - sensitive data should be annotated and potentially forwarded to a separate storage/tenant
- telemetry pipelines need to be tuned for long chats and multi-modal content
  - it's not typical to use otel pipelines with large data, need different batching strategies
  - congestion caused by large content may affect other data and the basic monitoring capabilities
- audit/compliance logs
  - pipelines that can provide necessary delivery guarantees for a subset of events and/or spans
These apps can tolerate additional configuration - the OTel setup is usually complicated enough.

Regardless of how prompts and full completions are recorded (events or attributes), they need some special handling around privacy and size. Events don't solve this on its own and GenAI telemetry still needs a lot of special processing to meet enterprise app needs.

Content stamped by reference on spans and potentially uploaded via a special channel for such data would satisfy most of these needs without requiring changes for spans and logs or backends that don't like large content.
And of course we can explore allowing to pass content by value.

Given that we're not dealing with real events (it's not point-in-time or independent from spans signal), it's weird to make local and early-days development experience harder - install extra components, apply extra config. So I consider prompts and completions (also tool definitions) to be verbose attributes.

lmolkova · 2025-03-22T00:55:18Z

Let's also consider different verbosity profiles GenAI data could have (in the ascending order):

[Default] spans, but no contents
- Best for performance, the most frugal in terms of telemetry volume, no sensitive data on telemetry
Spans have reference to the full contents, upload it somewhere accessible to the tooling
- Traditional telemetry pipelines and volume are not affected. Sensitive data is reported on a different channel.
- This could be a safe-ish default if we forget about perf impact
Spans/events contain have full (buffered) content unified across models/providers
- High spans/events volume, sensitive data in a general telemetry stream
Spans/events contain have full content that captures model request and response as is (useful for audit/compliance logs, record/replay features)
Events with streaming chunks and their content: maximum level of observability and volume. Note: basic event envelope may be 100x times bigger than the actual content it carries.

We can provide two config options:

choose how to report full content:
- don't
- report by value (attributes)
- report by reference (export/upload via a separate channel)
opt into low-level data:
- exact model request and response (audit and replay) - it should go on the channel that handles sensitive/large content
- per-chunk events

It should be possible to configure them independently (e.g. no full content, but event per chunk)

alexmojaki · 2025-03-24T11:50:57Z

Hear me out: span events. Yes, really, seriously, honestly, genuinely, sincerely.

We've dismissed them because they're being deprecated, which isn't necessarily true. open-telemetry/opentelemetry-specification#4430 says that the plan is deprecate the span events API, but allow emitting them through the logging API. It gives this motivation:

supporting use cases that rely on Span Events being emitted in the same proto envelope as their containing span.

which is exactly the situation here. So we have an OTel-approved way that can make everyone happy. Specifically, I propose:

Stick to using log-events.
Make it easy to attach those log-events to spans as span events. I would even suggest that this should be the default. Specifically, the instrumentation configuration should accept an optional event logger provider like any instrumentation would, but instead of defaulting to the global provider, default to a provider that attaches emitted events to the current span. This way the instrumentation works even if logging isn't configured. The default behaviour shouldn't depend on whether logging has been configured, it would be surprising for the data to move from spans to logs when users start globally configuring logging for unrelated reasons.
Use the flat event-per-part structure described in Model AI chat events as a list of request/response messages, with each message containing a list of parts #1913 (comment).

Benefits:

Almost all attributes (except tool call arguments and responses) can have primitive types and don't need JSON or anything special. This specifically requires the flat event structure. To do this with span attributes would require a hack like parallel arrays or keys containing numbers.
Existing SDK span attribute limits generally aren't precise, they just apply to all attributes, certainly not specific parts inside specific JSON attributes. A single span attribute containing a big JSON array is more likely to get truncated than multiple smaller attributes. Truncating JSON will often make it unparseable. Truncating individual span event (or log event) attributes means that long prompts will be sensibly truncated while the rest of the data is intact.
Users and backends that prefer log-events (e.g. when a long chat history would make a single envelope too big) can choose to use that.
Underlying configuration mechanism is generically reusable for other instrumentations, even outside GenAI.
SDKs that don't support logs at all can still implement instrumentations by just using span events directly. The instrumentations can smoothly transition when the SDK adds logs support.
SDK configuration for limiting the number of span events per span can be used, although it should probably drop events from the middle to be useful.
Log record processors can be used to filter events or specific event attributes before attaching them to a span, whereas using span processors to tweak attributes or span events is often more difficult/hacky/unsupported.

cartermp · 2025-03-24T19:00:10Z

Something that's tickling the back of my mind here is that I think we're dancing around some patterns and use cases but not mapping them out explicitly. The field is emerging, so we can't really be exhaustive, but laying out some of the needs for how you accomplish particular things seems important.

Some stuff that comes to mind which implies some pretty different needs related to representing, emitting, storing, and querying data:

Zero-shot request-responses (e.g., RAG + single prompt --> JSON object)
Short chat sessions (similar to above, but the user iterates a few times with the AI)
Cached long context + system prompt with much smaller request-responses thereafter (e.g., Contextual Retrieval)
Multimodal inputs and outputs
Realtime AI (typically audio today)
Streaming responses so users get more immediate feedback (chat, realtime, etc.)
Workflows with sub-operations where you don't stream the response into the rest of the system
Agent runs that perform N steps where N is now knowable up front

Each is also different in terms of its prominence and support in the broader tooling ecosystem. I would propose, for example, that far more organizations are doing fairly basic RAG that produces a structured JSON object for B2B SaaS than doing realtime AI streaming audio responses for consumer applications. The former group certainly has more tools available to help improve quality.

Anyways, that screed aside, I think it's hard for me to think about what existing data format I prefer over another because I don't yet have the best sense for which would be good/bad/terrible for each of the things I listed above. Really the only guiding principle I have is that a lot of existing tools just put prompts and responses on spans and call it a day, and so it's helpful to generally align with that for now.

lmolkova · 2025-03-25T18:52:43Z

Great idea to list different areas of interest that we'd like to cover.

Some thoughts:

Managed agents and cached content add the whole new level to this discussion:
- you don't have full context on each agent step - instead you have conversation/thread id or similar
- you don't have access to retrieved (by managed agent) data - you might have reference to the files/docs used if the agent is kind enough to share intermediate steps with the caller.
I.e. it won't be possible to run evaluations based on the telemetry for those cases. And it maybe more efficient to run evaluations based on the cached/stored context
The Realtime/Live API: how much do we want to (and can) unify? Given how fast the space evolves, we'll need to play whack-a-mole game and will always be months behind what model providers have. I'd rather encourage the content for all streaming events to follow the original format as is. We can still focus on a few areas that we want to unify and it can grow over time.

lmolkova · 2025-03-25T22:00:41Z

Based on today's spec and logs SIG discussions:

in a perfect world, unbounded and sensitive content doesn't belong on either spans or events (or span events)
stamping a reference on spans or events to such content is a good strategy
- expanding OTel to provide a side-channel for this content is a huge project
- relying on a custom solution to upload/export content is reasonable
having an opt-in 'pass by value' mode is reasonable too

I'm going to come up with a more specific proposal along the lines of #1428

karthikscale3 · 2025-03-26T06:03:42Z

First of all, thanks for putting this proposal together. In my humble opinion, this is a very reasonable medium term approach.

Separately, I think pass by reference to an external datastore + expanding OTEL to provide a side-channel for this reference will eventually be needed. I say this because when we are dealing with the likes of base64 encoded audio bytes for TTS/STT models or OpenAI's realtime API, we will need a way forward to accommodate this data format for users who choose to trace them as part of their systems.

I'd rather encourage the content for all streaming events to follow the original format as is. We can still focus on a few areas that we want to unify and it can grow over time.

In the medium term, I fully agree with this approach and think this is the only way to deal with how fast the space is evolving.

alexmojaki · 2025-03-26T10:10:32Z

having an opt-in 'pass by value' mode is reasonable too

What do you mean by opt-in? Just that capturing content needs to be enabled?

nirga · 2025-03-26T20:57:35Z

Thanks @lmolkova for taking this and I'm sorry that I couldn't get to it before.
I'm a strong supporter of using span attributes as the solution for prompts. I agree that size can be an issue - both on the client SDK and the server SDK (and we're starting to see this happening (see traceloop/openllmetry#2790). We'll need to figure out a way to offload larger payloads (like images) together with this proposal.

lmolkova · 2025-03-31T17:19:43Z

Here's a more detailed proposal - Modeling GenAI calls on telemetry, the prototype on OpenAI instrumentation

TL;DR:
I see separate storage for large/sensitive content as a long-term goal.
Events don't solve sensitivity problem and make size just a bit less problematic. They also complicate experience for non-production/low-scale/etc apps.

The proposal:

opt-in span attributes as a first choice
- if you upload the attribute value to external storage in collector, it results in a reference attribute on span
debug-level events too (if you need to record without spans or your backends needs it). if you upload event content to some external storage in the collector, it would result in an empty event that only contains the link 😞
instrumentation hook or collector uploader to external storage (BlobUploader utilities to enable handling of large data in instrumentation opentelemetry-python-contrib#3122, Add the concept of Blob Reference Properties ("reference attributes" and "reference fields") to the spec #1521, New component: Blob Upload Processor opentelemetry-collector-contrib#33737)

So if you want this data on your telemetry - pick attributes, events, or even both - they are independent. Backends would recommend a configuration that works for them.

As you app matures, volume grows, compliance becomes important - forward this content to external storage.

Check out the prototype - open-telemetry/opentelemetry-python-contrib#3397 - it's not hard to implement.

lmolkova · 2025-03-31T17:34:40Z

@alexmojaki speaking about span events - they have the same fundamental problems as attributes.

you can't record them if span is sampled out
they are exported along with the span, i.e. everything is buffered in the memory
they have standard attributes that don't work with large/binary content

They allow to express heterogeneous arrays and one-level-deep maps better than span attributes, but it can't be the only reason to keep span events alive. Let's keep pushing for structured attributes on spans - open-telemetry/opentelemetry-specification#4468 - we're going to discuss it on the spec call on Tue 4/7

alexmojaki · 2025-03-31T17:43:20Z

@alexmojaki speaking about span events - they have the same fundamental problems as attributes.

These are not what I consider the problems of span attributes. I have listed other ways in which span events work better than span attributes.

Events don't solve sensitivity problem and make size just a bit less problematic.

If you set OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT then span events and log-events still work reasonably well and a giant span attribute will tend to fail miserably.

They allow to express heterogeneous arrays and one-level-deep maps better than span attributes, but it can't be the only reason to keep span events alive.

It's not. Span events aren't being deprecated. What's the problem?

Let's keep pushing for structured attributes on spans - open-telemetry/opentelemetry-specification#4468 - we're going to discuss it on the spec call on Tue 4/7

Happy to have that too, but like external storage that's a big long term project.

alexmojaki · 2025-04-15T11:17:21Z

On the deprecation of span events: https://github.com/open-telemetry/opentelemetry-specification/pull/4430/files#r2044286664

alexmojaki · 2025-04-16T14:32:03Z

@lmolkova I'd like to try advancing the discussion here instead of relying only on the SIG call, that didn't work last week and I'm not sure I can attend tomorrow. Please can you explain why you don't think span events allow everyone to be happy? Even if open-telemetry/opentelemetry-specification#4430 is merged as is and all my comments are ignored, it still implies that there will be a standard way to convert log events to span events. That means that even if the GenAI spec stuck to log events as the only option for prompts/completions, there would still be a way for users to configure their SDK to put everything into the span in a way that backends can be expected to understand. And IMO this would work better than JSON span attributes and maybe even complex span attributes.

lmolkova · 2025-04-16T17:11:10Z

@alexmojaki

I believe what you're saying in your comment #2010 (comment) that from semconv perspective you'd prefer the following solution:

record each individual message as an independent log-based event
backends that want them on spans would transform them to span events via the common processor
backends that like log-based events would keep them as such.

If I understand correctly, then span events are the backend choice - not the contract we document in the semconv. You can create them from attributes or from log based events if you want to.

The problems I want to solve with this proposal:

easy way to get verbose GenAI telemetry for local/test development that could also work for low-scale/non-compliant/etc apps in production. Attributes work best. Log events -> span events pipeline will be harder with vanilla otel
fully compliant/high-privacy-requirements/high-scale mode where the chat history is stored separately from telemetry (and this storage can be managed by the app or by a telemetry system). Then telemetry only contains a reference to this data. Then having N log-based or span events with just references to the content stored separately results in a terrible user experience. There is a live demo of it in the Modeling GenAI calls on telemetry doc

So, if we pick log-based events per message as the only option, we make 1) local/basic experience more difficult 2) fully secure prod experience harder (writing/reading N tiny blobs with messages in the history is harder than working with 1 larger blob and you'd need to analyze it as a whole anyway).

I feel it's important to mention that span events won't solve structured or multimodal binary content problems - GenAI data is complex and deeply nested. We'll need to json string binary payloads, tool calls, tool arguments and other things.

I'd love to jump on a call to discuss the details. We have a GenAI SIG APAC call at 11pm PT today (8am CET tomorrow), I'll be there if you can join or feel free to ping me on slack.

lmolkova · 2025-04-16T17:53:52Z

A few more reasons that made me come up with this proposal are:

Tool call definitions or response json schema descriptions. How would we report them? They are input parameters, but a complex ones. It is a sufficient reason to report them as events? Would we report event per tool definition?

My position is no - the structured nature of something does not make it an event.
Size limitations. Yes, events have higher limits on the body size and would work better for the text than attributes. But it still doesn't solve the problem of arbitrarily large and sensitive content. I believe @cartermp mentioned that Honeycomb supports 100 KB (I might be wrong) for event body. But we're talking MBs-GBs of multimedia content.

My position is that nobody should store multimedia data in span attributes, span events, log attributes, or body.

alexmojaki · 2025-04-17T11:41:17Z

Update: I had a call with @lmolkova and @trask and I'm satisfied with letting span events go.

lmolkova added the area:gen-ai label Mar 19, 2025

github-project-automation bot added this to GenAI Semantic Conventions and Instrumentation libraries Mar 19, 2025

github-project-automation bot moved this to New issues in GenAI Semantic Conventions and Instrumentation libraries Mar 19, 2025

github-actions bot added the triage:needs-triage label Mar 19, 2025

github-project-automation bot added this to DRAFT - SemConv Issue Triage Mar 19, 2025

github-project-automation bot moved this to Need triage in DRAFT - SemConv Issue Triage Mar 19, 2025

lmolkova removed this from DRAFT - SemConv Issue Triage Mar 19, 2025

lmolkova removed the triage:needs-triage label Mar 19, 2025

aabmass mentioned this issue Mar 24, 2025

VertexAI handle streaming requests open-telemetry/opentelemetry-python-contrib#3331

Draft

7 tasks

lmolkova mentioned this issue Mar 31, 2025

Prototype: recording gen_ai content on attributes and uploading somewhere else open-telemetry/opentelemetry-python-contrib#3397

Draft

lmolkova moved this from New issues to How to model prompts and completions in GenAI Semantic Conventions and Instrumentation libraries Apr 3, 2025

trask mentioned this issue Apr 8, 2025

Proposal: Optional Compression for Large Attributes (e.g., db.statement) in Spans and Logs #2068

Open

alexmojaki mentioned this issue Apr 15, 2025

OTEP: Span Event API deprecation plan open-telemetry/opentelemetry-specification#4430

Merged

This was referenced Apr 24, 2025

[gen-ai] gen_ai.agent semantic conventions need to be expanded further to include tools list #2112

Open

[WIP] Using attributes and refs for chat history on gen_ai spans #2179

Draft

lmolkova mentioned this issue May 4, 2025

Add OpenAI embeddings instrumentation open-telemetry/opentelemetry-python-contrib#3461

Open

10 tasks

alexmojaki mentioned this issue May 15, 2025

Model AI chat events as a list of request/response messages, with each message containing a list of parts #1913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture GenAI prompts and completions as events or attributes #2010

Capture GenAI prompts and completions as events or attributes #2010

lmolkova commented Mar 19, 2025 •

edited

Loading

lmolkova commented Mar 19, 2025 •

edited

Loading

Cirilla-zmh commented Mar 20, 2025 •

edited

Loading

ralf0131 commented Mar 20, 2025

ThomasVitale commented Mar 20, 2025

aabmass commented Mar 20, 2025

codefromthecrypt commented Mar 21, 2025 •

edited

Loading

cartermp commented Mar 21, 2025 •

edited

Loading

lmolkova commented Mar 22, 2025 •

edited

Loading

lmolkova commented Mar 22, 2025

lmolkova commented Mar 22, 2025 •

edited

Loading

alexmojaki commented Mar 24, 2025

cartermp commented Mar 24, 2025

lmolkova commented Mar 25, 2025

lmolkova commented Mar 25, 2025 •

edited

Loading

karthikscale3 commented Mar 26, 2025

alexmojaki commented Mar 26, 2025

nirga commented Mar 26, 2025

lmolkova commented Mar 31, 2025 •

edited

Loading

lmolkova commented Mar 31, 2025 •

edited

Loading

alexmojaki commented Mar 31, 2025

alexmojaki commented Apr 15, 2025

alexmojaki commented Apr 16, 2025

lmolkova commented Apr 16, 2025 •

edited

Loading

lmolkova commented Apr 16, 2025 •

edited

Loading

alexmojaki commented Apr 17, 2025

Capture GenAI prompts and completions as events or attributes #2010

Capture GenAI prompts and completions as events or attributes #2010

Comments

lmolkova commented Mar 19, 2025 • edited Loading

lmolkova commented Mar 19, 2025 • edited Loading

Cirilla-zmh commented Mar 20, 2025 • edited Loading

ralf0131 commented Mar 20, 2025

ThomasVitale commented Mar 20, 2025

aabmass commented Mar 20, 2025

codefromthecrypt commented Mar 21, 2025 • edited Loading

The High Cost of Log Events

UX made more difficult

Focus imperative

We have options, but they are far less with log events.

We don't have to boil the spec ocean in this decision

Conclusion

cartermp commented Mar 21, 2025 • edited Loading

lmolkova commented Mar 22, 2025 • edited Loading

lmolkova commented Mar 22, 2025

lmolkova commented Mar 22, 2025 • edited Loading

alexmojaki commented Mar 24, 2025

cartermp commented Mar 24, 2025

lmolkova commented Mar 25, 2025

lmolkova commented Mar 25, 2025 • edited Loading

karthikscale3 commented Mar 26, 2025

alexmojaki commented Mar 26, 2025

nirga commented Mar 26, 2025

lmolkova commented Mar 31, 2025 • edited Loading

lmolkova commented Mar 31, 2025 • edited Loading

alexmojaki commented Mar 31, 2025

alexmojaki commented Apr 15, 2025

alexmojaki commented Apr 16, 2025

lmolkova commented Apr 16, 2025 • edited Loading

lmolkova commented Apr 16, 2025 • edited Loading

alexmojaki commented Apr 17, 2025

lmolkova commented Mar 19, 2025 •

edited

Loading

lmolkova commented Mar 19, 2025 •

edited

Loading

Cirilla-zmh commented Mar 20, 2025 •

edited

Loading

codefromthecrypt commented Mar 21, 2025 •

edited

Loading

cartermp commented Mar 21, 2025 •

edited

Loading

lmolkova commented Mar 22, 2025 •

edited

Loading

lmolkova commented Mar 22, 2025 •

edited

Loading

lmolkova commented Mar 25, 2025 •

edited

Loading

lmolkova commented Mar 31, 2025 •

edited

Loading

lmolkova commented Mar 31, 2025 •

edited

Loading

lmolkova commented Apr 16, 2025 •

edited

Loading

lmolkova commented Apr 16, 2025 •

edited

Loading