Skip to content

Capture GenAI prompts and completions as events or attributes #2010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lmolkova opened this issue Mar 19, 2025 · 25 comments
Open

Capture GenAI prompts and completions as events or attributes #2010

lmolkova opened this issue Mar 19, 2025 · 25 comments

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Mar 19, 2025

The GenAI SIG has been discussing how to capture prompts and completion for a while and there are several issues that are blocked on this discussion (#1913, #1883, #1556)

What we have today on OTel semconv is a set of structured events exported separately from spans. This was implemented in #980 and discussed in #834 and #829. The motivation to use events was to

  • overcome size limits on attribute values by using event body
  • use a signal that supports structured body and attributes
  • have a clear 1:1 relationship between event name and structure (as opposed to polymorphic types or arrays of heterogeneous objects)
  • make it possible and easy to consume individual events and prompts/completions without spans
  • have verbosity controls

Turns out that:

  • after ~9 months events are still not adopted by GenAI-focused tracing tools and their external instrumentation libs including Arize, Traceloop, Langtrace - all these providers use span attributes to capture prompts and completions.
  • These backends consume prompts and completions along with spans and don't envision separating them - they store and visualize this data altogether

So, the GenAI SIG is re-litigating this decision taking into account backends' feedback and other relevant issues: #1621, #1912, open-telemetry/opentelemetry-specification#4414


The fundamental question is whether this data belongs on the span or is a separate thing useful without a span.

How it can be useful without a span:

To be useful without a span, events should probably duplicate some of the span attributes - endpoint, model used, input parameters, etc - it's not the case today

Are prompts/completions point-in-time telemetry?

Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events


Another fundamental question is how and if to capture unbounded (text, video, audio, etc) data on telemetry

It's problematic because of:

  • privacy - prompts can contain health concerns, ssns, addresses, names, etc. Apps that remain compliant with different regulators would have a problem of sharing this data with a broad audience of DevOps humans. The data should be accessible for evaluations, audit, but access should be restricted
  • size - non-GenAI specific backends are not optimized for this and it's expensive to store such data in hot storage.

Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.

Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)


TL;DR:

  • current approach doesn't work, we're blocked and need to find path forward.
  • GenAI-focused backends, innerloop scenarios, non-production apps would benefit from having prompts/completions stamped on the spans directly
  • General-purpose observability backends and high-scale applications would have a problem with sensitive/large/binary data coming from end-users on telemetry anyway
@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 19, 2025

Proposal

We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default.

If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need gen_ai.inputs, gen_ai.outputs, gen_ai.tool.calls, gen_ai.tool.definitions, etc - we'll figure it out.
(Related: open-telemetry/opentelemetry-specification#4446)

We'll still need to address the large|sensitive-content-by-reference in the future:

  • it will involve a separate set of attributes to record references
  • the content will likely be stored separately from spans or logs

When this comes along, we'll provide a new way to opt-in into a new solution which might replace or could coexist with attributes.

Stamping them as attributes now would allow us to provide a simple solution for existing tooling and some of less mature applications and would not block us from making progress on the proper solution for the large-content problem.

@Cirilla-zmh
Copy link
Member

Cirilla-zmh commented Mar 20, 2025

So happy to see this proposal!

Streaming chunks, if captured at all, would have timestamps (#1964)

Another concern is about the long-term memory costs for in-proc telemetry tools.

Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events.

Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.

Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.

Yes! In fact, this is what I think is ideal. But I believe we still have a lot of work to do before that day arrives:

  • We still need to format the data to prevent it from becoming heterogeneous.
  • We may also need to provide some implementation/best practice use cases to show how the observables backend/evaluators get this data properly.

@ralf0131
Copy link

Proposal

We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default.

Another way is to store a preview of that prompts/completion. Say, the first 1000 tokens, which will make the user easier for trouble shooting, at least they know what the prompts/completion are about.

If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need gen_ai.inputs, gen_ai.outputs, gen_ai.tool.calls, gen_ai.tool.definitions, etc - we'll figure it out. (Related: open-telemetry/opentelemetry-specification#4446)

We'll still need to address the large|sensitive-content-by-reference in the future:

  • it will involve a separate set of attributes to record references
  • the content will likely be stored separately from spans or logs

Another way to solve this, is to keep the original data, and use OTel collector to remove the sensitive content if user wants to.

@ThomasVitale
Copy link

I like this proposal! Coming from the experience with Spring AI, I see the value in having prompts and completions contextually available as part of a span (it's also the current behavior in the framework). It's a bit unfortunate span events have been deprecated, as that would have been my primary suggestion instead of span attributes.

@aabmass
Copy link
Member

aabmass commented Mar 20, 2025

Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)

I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?

Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.

👍

I do want to add a few concerns I don't think we've discussed yet

  1. Bidi streaming completions. Both OpenAI Realtime API and VertexAI Live API support this, where the user and model communicate over an ordered channel like a WebSocket. It's not clear how to model this as spans but events seem like a more obvious fit. I imagine this will eventually be in scope.
  2. Server side prompt/response capturing. A proxy (like TraceLoop Hub and others) or the inference engine itself (like self hosted vLLM or an LLM vendor) could emit prompt/responses. Since you can't modify remote spans, they would need to create server side spans as well to stamp the data onto. Vs with events, you can attach a log to any span ID. Maybe the "by reference" thing would solve it.

@codefromthecrypt
Copy link
Contributor

codefromthecrypt commented Mar 21, 2025

Thank you for being open to revisit this decision. The initial experiment with log events was indeed both unpopular and expensive. Span attributes is the way to meet the ecosystem where they are, and allow us to focus on problems of genai instead of obscure infrastructure. It also allows not just the existing vendor ecosystem, but also systems like Jaeger to be first class spec implementations again.

More details below since not everyone has had as long history with the prior state as I've had. I hope it helps:


The High Cost of Log Events

The current approach to events has demanded significant effort with limited payoff. Elastic alone has invested close to—or possibly more than—a person-year on this topic. This effort spanned:

  • Debates: High volume discussions, made longer due to being about an unimplemented feature of otel
  • Community Struggles: Angst created by committing a spec change no eval company could adopt
  • Implementation Challenges: Implementing anyway then having version lock-up or lack of feature clarity per language
  • Infrastructure Challenges: Finding out technology like ottl can't bridge log events back to the span
  • Portability Challenges: Knowingly limiting systems like Jaeger from being full featured due to required log support

UX made more difficult

I’ve personally worked with projects like otel-tui and Zipkin to add log support specifically for this spec. The experience was more navigation than before with no benefit. Since otel only has a few genai instrumentation, you end up relying on 3rd parties like langtrace, openinference or openllmetry to fill in the gaps. Most use span attributes, so the full trace gets very confusing where some parts are in attributes and others need clicking or typing around to figure out the logs attached to something.

Focus imperative

I'm not alone in needing a couple hours a day just for GenAI ecosystem change. We have to make decisions that relate to the focus that's important. A deeply troubled technical choice hurts our ability to stay relevant, as these problems steal time from what is. This is a primary reason so few vendors adopted the log events. In personal comments to me, many said they simply cannot afford to redo their arch and UX just to satisfy the log events. It would come at the cost of time spent in service to customers, so just didn't happen.

We have options, but they are far less with log events.

Since this started, we looked at many ways to get things to work. While most didn't pan out and/or were theoretical (collector processor can in theory do anything), we have a lot of options if we flip the tables back as suggested::

  • language SDKs can provide hooks to control data policy and mapping (e.g. to span events)
  • OTTL can do the same when everything is in the same span
  • There's a blob uploader API in progress which could also optionally be used for sites that have thresholds where links cause more good than harm.

We don't have to boil the spec ocean in this decision

This decision is about chat completions and similar APIs such as responses. It does not have to be the same decision for an as-yet unspecified semantic mapping for real time APIs. We shouldn't hold this choice hostage to unexplored areas which would vary wildly per vendor. chat completions is a high leverage API and many inference platforms support it the same way, by emulating OpenAI. Let's not get distracted about potential other APIs which might not fit.

Conclusion

In summary, the experiment with events, especially logs, taught us valuable lessons but proved too costly and unpopular to sustain. By focusing on span attributes, we can reduce complexity, improve UX, and align with the ecosystem’s strengths—paving the way for a spec the community will embrace. I’m excited to see this revisited and look forward to refining OTel together, as a group, not just those who have log ingress like Elastic.

@cartermp
Copy link
Contributor

cartermp commented Mar 21, 2025

Just popping in here to say that I think this is the right proposal. Whether it's the ideal way or not, most tools for evals (and arguably production monitoring of AI behavior) treat traces as the source of truth instead of connective tissue between other things.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 22, 2025

@aabmass

I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?

If storing large and sensitive content on the telemetry is a problem (it is) we'll have to solve it. Companies that work with enterprise customers would need to find a solution to this regardless of the signal large and sensitive data is recorded on.

I do want to add a few concerns I don't think we've discussed yet

  1. Bidi streaming completions. Both OpenAI Realtime API and VertexAI Live API support this, where the user and model communicate over an ordered channel like a WebSocket. It's not clear how to model this as spans but events seem like a more obvious fit. I imagine this will eventually be in scope.

Our current events don't work for realtime API and don't even attempt to cover multi-modal content. If we need to record point-in-time telemetry, the events would be the right choice.
So streaming chunks are likely a good candidates for events. But prompts and buffered completions are not point-in-time - they don't happen at a specific time not covered by span start/end timestamps.

  1. Server side prompt/response capturing. A proxy (like TraceLoop Hub and others) or the inference engine itself (like self hosted vLLM or an LLM vendor) could emit prompt/responses. Since you can't modify remote spans, they would need to create server side spans as well to stamp the data onto. Vs with events, you can attach a log to any span ID. Maybe the "by reference" thing would solve it.

OTel assumes you always create client and server-side spans for each network call. We don't document how to model GenAI proxy server spans and events and I don't believe they should repeat client ones anyway.

The TL;DR:
This proposal affects current events that describe prompts and fully buffered completion. If we come up with a criteria of what should be reported as an event we'd say that it should have a meaningful timestamp and could be useful without a span. It seems neither is true for the events we have today.
It doesn't mean that we should always use span attributes for every future GenAI content - absolutely not.

@lmolkova
Copy link
Contributor Author

I've been thinking about these two extreme cases, and a spectrum of options between them:

  1. Inner-loop/local experience, non-production applications, or application that can get their production obs needs satisfied with existing GenAI-focused backends/tooling. Their needs:

    • easy setup, easy to use on any backend and in local observability tools (local Jaeger, otel-tui, Aspire, etc)
    • verbose data, telemetry volume is not a huge concern
    • don't require compliance with regulators

    Prompt and buffered completion content passed on span attributes fit nicely:

    • prompts and full completion don't have a timestamp
    • easy OTel setup
  2. Enterprise applications:

    • need different access permissions for prompts/completions vs regular performance/usage telemetry
      • sensitive data should be annotated and potentially forwarded to a separate storage/tenant
    • telemetry pipelines need to be tuned for long chats and multi-modal content
      • it's not typical to use otel pipelines with large data, need different batching strategies
      • congestion caused by large content may affect other data and the basic monitoring capabilities
    • audit/compliance logs
      • pipelines that can provide necessary delivery guarantees for a subset of events and/or spans

    These apps can tolerate additional configuration - the OTel setup is usually complicated enough.

    Regardless of how prompts and full completions are recorded (events or attributes), they need some special handling around privacy and size. Events don't solve this on its own and GenAI telemetry still needs a lot of special processing to meet enterprise app needs.

    Content stamped by reference on spans and potentially uploaded via a special channel for such data would satisfy most of these needs without requiring changes for spans and logs or backends that don't like large content.
    And of course we can explore allowing to pass content by value.

Given that we're not dealing with real events (it's not point-in-time or independent from spans signal), it's weird to make local and early-days development experience harder - install extra components, apply extra config. So I consider prompts and completions (also tool definitions) to be verbose attributes.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 22, 2025

Let's also consider different verbosity profiles GenAI data could have (in the ascending order):

  1. [Default] spans, but no contents
    • Best for performance, the most frugal in terms of telemetry volume, no sensitive data on telemetry
  2. Spans have reference to the full contents, upload it somewhere accessible to the tooling
    • Traditional telemetry pipelines and volume are not affected. Sensitive data is reported on a different channel.
    • This could be a safe-ish default if we forget about perf impact
  3. Spans/events contain have full (buffered) content unified across models/providers
    • High spans/events volume, sensitive data in a general telemetry stream
  4. Spans/events contain have full content that captures model request and response as is (useful for audit/compliance logs, record/replay features)
  5. Events with streaming chunks and their content: maximum level of observability and volume. Note: basic event envelope may be 100x times bigger than the actual content it carries.

We can provide two config options:

  1. choose how to report full content:
    • don't
    • report by value (attributes)
    • report by reference (export/upload via a separate channel)
  2. opt into low-level data:
    • exact model request and response (audit and replay) - it should go on the channel that handles sensitive/large content
    • per-chunk events

It should be possible to configure them independently (e.g. no full content, but event per chunk)

@alexmojaki
Copy link

Hear me out: span events. Yes, really, seriously, honestly, genuinely, sincerely.

We've dismissed them because they're being deprecated, which isn't necessarily true. open-telemetry/opentelemetry-specification#4430 says that the plan is deprecate the span events API, but allow emitting them through the logging API. It gives this motivation:

supporting use cases that rely on Span Events being emitted in the same proto envelope as their containing span.

which is exactly the situation here. So we have an OTel-approved way that can make everyone happy. Specifically, I propose:

  1. Stick to using log-events.
  2. Make it easy to attach those log-events to spans as span events. I would even suggest that this should be the default. Specifically, the instrumentation configuration should accept an optional event logger provider like any instrumentation would, but instead of defaulting to the global provider, default to a provider that attaches emitted events to the current span. This way the instrumentation works even if logging isn't configured. The default behaviour shouldn't depend on whether logging has been configured, it would be surprising for the data to move from spans to logs when users start globally configuring logging for unrelated reasons.
  3. Use the flat event-per-part structure described in Model AI chat events as a list of request/response messages, with each message containing a list of parts #1913 (comment).

Benefits:

  1. Almost all attributes (except tool call arguments and responses) can have primitive types and don't need JSON or anything special. This specifically requires the flat event structure. To do this with span attributes would require a hack like parallel arrays or keys containing numbers.
  2. Existing SDK span attribute limits generally aren't precise, they just apply to all attributes, certainly not specific parts inside specific JSON attributes. A single span attribute containing a big JSON array is more likely to get truncated than multiple smaller attributes. Truncating JSON will often make it unparseable. Truncating individual span event (or log event) attributes means that long prompts will be sensibly truncated while the rest of the data is intact.
  3. Users and backends that prefer log-events (e.g. when a long chat history would make a single envelope too big) can choose to use that.
  4. Underlying configuration mechanism is generically reusable for other instrumentations, even outside GenAI.
  5. SDKs that don't support logs at all can still implement instrumentations by just using span events directly. The instrumentations can smoothly transition when the SDK adds logs support.
  6. SDK configuration for limiting the number of span events per span can be used, although it should probably drop events from the middle to be useful.
  7. Log record processors can be used to filter events or specific event attributes before attaching them to a span, whereas using span processors to tweak attributes or span events is often more difficult/hacky/unsupported.

@cartermp
Copy link
Contributor

Something that's tickling the back of my mind here is that I think we're dancing around some patterns and use cases but not mapping them out explicitly. The field is emerging, so we can't really be exhaustive, but laying out some of the needs for how you accomplish particular things seems important.

Some stuff that comes to mind which implies some pretty different needs related to representing, emitting, storing, and querying data:

  • Zero-shot request-responses (e.g., RAG + single prompt --> JSON object)
  • Short chat sessions (similar to above, but the user iterates a few times with the AI)
  • Cached long context + system prompt with much smaller request-responses thereafter (e.g., Contextual Retrieval)
  • Multimodal inputs and outputs
  • Realtime AI (typically audio today)
  • Streaming responses so users get more immediate feedback (chat, realtime, etc.)
  • Workflows with sub-operations where you don't stream the response into the rest of the system
  • Agent runs that perform N steps where N is now knowable up front

Each is also different in terms of its prominence and support in the broader tooling ecosystem. I would propose, for example, that far more organizations are doing fairly basic RAG that produces a structured JSON object for B2B SaaS than doing realtime AI streaming audio responses for consumer applications. The former group certainly has more tools available to help improve quality.

Anyways, that screed aside, I think it's hard for me to think about what existing data format I prefer over another because I don't yet have the best sense for which would be good/bad/terrible for each of the things I listed above. Really the only guiding principle I have is that a lot of existing tools just put prompts and responses on spans and call it a day, and so it's helpful to generally align with that for now.

@lmolkova
Copy link
Contributor Author

Great idea to list different areas of interest that we'd like to cover.

Some thoughts:

  1. Managed agents and cached content add the whole new level to this discussion:

    • you don't have full context on each agent step - instead you have conversation/thread id or similar
    • you don't have access to retrieved (by managed agent) data - you might have reference to the files/docs used if the agent is kind enough to share intermediate steps with the caller.

    I.e. it won't be possible to run evaluations based on the telemetry for those cases. And it maybe more efficient to run evaluations based on the cached/stored context

  2. The Realtime/Live API: how much do we want to (and can) unify? Given how fast the space evolves, we'll need to play whack-a-mole game and will always be months behind what model providers have. I'd rather encourage the content for all streaming events to follow the original format as is. We can still focus on a few areas that we want to unify and it can grow over time.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 25, 2025

Based on today's spec and logs SIG discussions:

  • in a perfect world, unbounded and sensitive content doesn't belong on either spans or events (or span events)
  • stamping a reference on spans or events to such content is a good strategy
    • expanding OTel to provide a side-channel for this content is a huge project
    • relying on a custom solution to upload/export content is reasonable
  • having an opt-in 'pass by value' mode is reasonable too

I'm going to come up with a more specific proposal along the lines of #1428

@karthikscale3
Copy link
Contributor

First of all, thanks for putting this proposal together. In my humble opinion, this is a very reasonable medium term approach.

Separately, I think pass by reference to an external datastore + expanding OTEL to provide a side-channel for this reference will eventually be needed. I say this because when we are dealing with the likes of base64 encoded audio bytes for TTS/STT models or OpenAI's realtime API, we will need a way forward to accommodate this data format for users who choose to trace them as part of their systems.

I'd rather encourage the content for all streaming events to follow the original format as is. We can still focus on a few areas that we want to unify and it can grow over time.

In the medium term, I fully agree with this approach and think this is the only way to deal with how fast the space is evolving.

@alexmojaki
Copy link

  • having an opt-in 'pass by value' mode is reasonable too

What do you mean by opt-in? Just that capturing content needs to be enabled?

@nirga
Copy link
Contributor

nirga commented Mar 26, 2025

Thanks @lmolkova for taking this and I'm sorry that I couldn't get to it before.
I'm a strong supporter of using span attributes as the solution for prompts. I agree that size can be an issue - both on the client SDK and the server SDK (and we're starting to see this happening (see traceloop/openllmetry#2790). We'll need to figure out a way to offload larger payloads (like images) together with this proposal.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 31, 2025

Here's a more detailed proposal - Modeling GenAI calls on telemetry, the prototype on OpenAI instrumentation

TL;DR:
I see separate storage for large/sensitive content as a long-term goal.
Events don't solve sensitivity problem and make size just a bit less problematic. They also complicate experience for non-production/low-scale/etc apps.

The proposal:

So if you want this data on your telemetry - pick attributes, events, or even both - they are independent. Backends would recommend a configuration that works for them.

As you app matures, volume grows, compliance becomes important - forward this content to external storage.

Check out the prototype - open-telemetry/opentelemetry-python-contrib#3397 - it's not hard to implement.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 31, 2025

@alexmojaki speaking about span events - they have the same fundamental problems as attributes.

  • you can't record them if span is sampled out
  • they are exported along with the span, i.e. everything is buffered in the memory
  • they have standard attributes that don't work with large/binary content

They allow to express heterogeneous arrays and one-level-deep maps better than span attributes, but it can't be the only reason to keep span events alive. Let's keep pushing for structured attributes on spans - open-telemetry/opentelemetry-specification#4468 - we're going to discuss it on the spec call on Tue 4/7

@alexmojaki
Copy link

@alexmojaki speaking about span events - they have the same fundamental problems as attributes.

These are not what I consider the problems of span attributes. I have listed other ways in which span events work better than span attributes.

Events don't solve sensitivity problem and make size just a bit less problematic.

If you set OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT then span events and log-events still work reasonably well and a giant span attribute will tend to fail miserably.

They allow to express heterogeneous arrays and one-level-deep maps better than span attributes, but it can't be the only reason to keep span events alive.

It's not. Span events aren't being deprecated. What's the problem?

Let's keep pushing for structured attributes on spans - open-telemetry/opentelemetry-specification#4468 - we're going to discuss it on the spec call on Tue 4/7

Happy to have that too, but like external storage that's a big long term project.

@alexmojaki
Copy link

On the deprecation of span events: https://github.com/open-telemetry/opentelemetry-specification/pull/4430/files#r2044286664

@alexmojaki
Copy link

@lmolkova I'd like to try advancing the discussion here instead of relying only on the SIG call, that didn't work last week and I'm not sure I can attend tomorrow. Please can you explain why you don't think span events allow everyone to be happy? Even if open-telemetry/opentelemetry-specification#4430 is merged as is and all my comments are ignored, it still implies that there will be a standard way to convert log events to span events. That means that even if the GenAI spec stuck to log events as the only option for prompts/completions, there would still be a way for users to configure their SDK to put everything into the span in a way that backends can be expected to understand. And IMO this would work better than JSON span attributes and maybe even complex span attributes.

@lmolkova
Copy link
Contributor Author

lmolkova commented Apr 16, 2025

@alexmojaki

I believe what you're saying in your comment #2010 (comment) that from semconv perspective you'd prefer the following solution:

  • record each individual message as an independent log-based event
  • backends that want them on spans would transform them to span events via the common processor
  • backends that like log-based events would keep them as such.

If I understand correctly, then span events are the backend choice - not the contract we document in the semconv. You can create them from attributes or from log based events if you want to.

The problems I want to solve with this proposal:

  • easy way to get verbose GenAI telemetry for local/test development that could also work for low-scale/non-compliant/etc apps in production. Attributes work best. Log events -> span events pipeline will be harder with vanilla otel

  • fully compliant/high-privacy-requirements/high-scale mode where the chat history is stored separately from telemetry (and this storage can be managed by the app or by a telemetry system). Then telemetry only contains a reference to this data. Then having N log-based or span events with just references to the content stored separately results in a terrible user experience. There is a live demo of it in the Modeling GenAI calls on telemetry doc

So, if we pick log-based events per message as the only option, we make 1) local/basic experience more difficult 2) fully secure prod experience harder (writing/reading N tiny blobs with messages in the history is harder than working with 1 larger blob and you'd need to analyze it as a whole anyway).

I feel it's important to mention that span events won't solve structured or multimodal binary content problems - GenAI data is complex and deeply nested. We'll need to json string binary payloads, tool calls, tool arguments and other things.

I'd love to jump on a call to discuss the details. We have a GenAI SIG APAC call at 11pm PT today (8am CET tomorrow), I'll be there if you can join or feel free to ping me on slack.

@lmolkova
Copy link
Contributor Author

lmolkova commented Apr 16, 2025

A few more reasons that made me come up with this proposal are:

  1. Tool call definitions or response json schema descriptions. How would we report them? They are input parameters, but a complex ones. It is a sufficient reason to report them as events? Would we report event per tool definition?

    My position is no - the structured nature of something does not make it an event.

  2. Size limitations. Yes, events have higher limits on the body size and would work better for the text than attributes. But it still doesn't solve the problem of arbitrarily large and sensitive content. I believe @cartermp mentioned that Honeycomb supports 100 KB (I might be wrong) for event body. But we're talking MBs-GBs of multimedia content.

    My position is that nobody should store multimedia data in span attributes, span events, log attributes, or body.

@alexmojaki
Copy link

Update: I had a call with @lmolkova and @trask and I'm satisfied with letting span events go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: How to model prompts and completions
Development

No branches or pull requests

10 participants