-
Notifications
You must be signed in to change notification settings - Fork 91
fix(openai-responses): emit reasoning blocks from responses stream #1156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughParses OpenAI Responses SSE reasoning/thinking events into ThinkingBlock content by accumulating reasoning deltas (including reasoning summary), emitting thinking blocks on done or stream end, and flushing reasoning before emitting usage metadata. Adds a Vitest suite that simulates SSE streams to validate interleaving with text/tool calls and edge cases. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant OpenAI_SSE as "OpenAI SSE"
participant Parser as "parseResponsesStream"
participant Consumer as "History / Consumer"
Client->>OpenAI_SSE: request /responses stream
OpenAI_SSE->>Parser: stream events (output_text.delta, reasoning_text.delta, reasoning_summary_text.delta, output_item.*, response.completed, ...)
alt reasoning delta events
Parser->>Parser: accumulate reasoningText / reasoningSummaryText
end
alt output_text or tool call events
Parser->>Consumer: emit TextBlock / ToolCall
end
alt reasoning done or stream end
Parser->>Consumer: emit ThinkingBlock (flush accumulated reasoning/summary)
end
Parser->>Consumer: emit Usage metadata (after flushing reasoning)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧠 Learnings (3)📓 Common learnings📚 Learning: 2025-12-18T14:06:22.557ZApplied to files:
📚 Learning: 2025-11-16T22:51:26.374ZApplied to files:
⏰ Context from checks skipped due to timeout of 270000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
🔇 Additional comments (7)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Issue Alignment: Let me analyze this PR for the code review. Understanding the PRThis PR #1156 fixes issue #922 - "OpenAI Responses streaming drops reasoning/thinking (no ThinkingBlocks)". The PR:
Key Changes:parseResponsesStream.ts:
Test file (new):
AnalysisLet me analyze:
I'll examine the code's robustness and test coverage. The implementation demonstrates careful handling of reasoning content with multiple safeguards. State management looks clean, using The approach shows strong attention to edge cases:
Test coverage appears comprehensive, addressing various scenarios like partial streams, interleaved content, and metadata interactions. The test suite validates the new reasoning parsing mechanism across multiple input configurations. Overall, the implementation provides a robust solution for capturing reasoning content during response streaming, with minimal side effects and thorough validation. File 1: Let me analyze the PR changes to provide a comprehensive code review. From the diff, I can see:
The changes implement:
Let me review the issue requirements vs the implementation: Issue #922 Requirements:
Implementation Analysis:
Code Quality Issues I should flag: I notice potential robustness concerns in the reasoning content handling. The current implementation might struggle with complex reasoning stream scenarios, potentially losing nuanced reasoning content. Specifically, the filtering mechanism for whitespace and empty content could inadvertently discard meaningful reasoning segments. The lack of explicit error handling for parsing or processing events introduces potential reliability risks in the reasoning stream processing pipeline. The type inference for delta events suggests some type safety considerations that might require additional scrutiny to ensure comprehensive event type coverage and prevent potential runtime type-related issues. I see a critical state management problem where message order isn't guaranteed. The current implementation yields thinking blocks asynchronously, which could lead to unexpected message sequencing. The reasoning text isn't properly cleared on stream interruption, creating potential data leakage between requests. The whitespace filtering needs refinement to prevent yielding partial or meaningless thinking blocks. The test suite demonstrates robust coverage across multiple scenarios, including delta accumulation, interleaved content, empty reasoning, and summary text events. These tests validate the core logic of thinking block generation and handling complex stream scenarios. The proposed changes introduce a targeted fix for stream parsing, with careful attention to incremental content processing and state management. The modification allows tracking reasoning content across multiple stream events while providing flexibility for incomplete or interrupted streams. The side effect analysis reveals nuanced changes to message emission behavior, potentially impacting downstream consumers of the stream parsing system. This requires careful consideration of backward compatibility and existing code assumptions about stream processing. Key concerns emerge around incremental reasoning block generation, which could subtly alter the expected message sequence for existing integrations. The implementation strategically yields reasoning blocks only when complete or upon stream finalization, maintaining a balance between real-time processing and comprehensive result capture. Performance considerations suggest minimal overhead, with the accumulation approach enabling efficient chunk processing. The approach preserves the core functionality while introducing more granular streaming capabilities. Let me complete the todo list and write the final review verdict. Issue #922 Problem: Let me analyze PR #1156 which fixes issue #922 about OpenAI Responses streaming dropping reasoning/thinking blocks. Based on the diff provided:
Let me evaluate: Issue Alignment: ✅ The implementation adds proper buffering and emission of Code Quality Analysis: The code strategically buffers reasoning deltas, allowing accumulation of fragmented reasoning content before emitting a complete Error handling appears thoughtful, with explicit conditions to prevent empty or whitespace-only reasoning blocks from being processed. The implementation maintains clean separation of reasoning and reasoning summary events, supporting flexible content generation scenarios. The developer has implemented defensive programming techniques, including careful state management and robust filtering of potentially meaningless content. The code demonstrates careful attention to edge cases in streaming data processing. Type definitions and interfaces suggest a structured approach to handling complex streaming responses, with explicit tracking of sequence numbers and content states. The implementation carefully manages partial data transmission through incremental event processing, ensuring comprehensive content capture across multiple delta events. Test coverage validates critical streaming scenarios, including whitespace handling and multiple delta block accumulation. However, integration testing with real OpenAI Responses API streaming remains a potential improvement area. Performance considerations indicate efficient string management, with reasonable memory usage anticipated for typical reasoning content generation. Unicode handling through The implementation demonstrates solid defensive programming techniques, proactively addressing potential edge cases in streaming response parsing. The approach appears comprehensive, with minimal performance overhead expected from incremental string concatenation. LLxprt PR Review – PR #1156Issue Alignment
Side Effects
Code Quality
Tests & Coverage
Verdict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/core/src/providers/openai/parseResponsesStream.ts`:
- Line 58: The code currently only handles "response.reasoning_text.delta" and
"response.reasoning_text.done" but misses
"response.reasoning_summary_text.delta"; update the event handling inside
parseResponsesStream (where reasoningText is declared and where events 95-117
are processed) to treat "response.reasoning_summary_text.delta" the same as
"response.reasoning_text.delta" by appending its payload to the existing
reasoningText buffer, and ensure any corresponding "done" handling merges or
finalizes reasoningText as done; reference the reasoningText variable and the
response event switch/if blocks to add the new branch for
"response.reasoning_summary_text.delta" so all reasoning variants are captured.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/core/src/providers/openai/parseResponsesStream.reasoning.test.tspackages/core/src/providers/openai/parseResponsesStream.ts
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: e2720pjk
Repo: vybestack/llxprt-code PR: 583
File: packages/core/src/providers/openai/OpenAIProvider.ts:935-959
Timestamp: 2025-11-16T22:51:26.374Z
Learning: In the llxprt-code codebase (packages/core/src/providers/openai/OpenAIProvider.ts), tools like `run_shell_command` use internal streaming only for real-time UI updates during execution, but each tool execution produces exactly ONE final `ToolResponseBlock` containing the full result. The streaming chunks are never sent to the LLM and are not converted into multiple tool messages. The OpenAI Chat Completions API requires that each tool call (tool_call_id) corresponds to exactly one message with role 'tool', so duplicate tool response detection that removes subsequent tool messages with the same tool_call_id is correct and necessary for API compliance.
Learnt from: acoliver
Repo: vybestack/llxprt-code PR: 0
File: :0-0
Timestamp: 2025-12-18T14:06:22.557Z
Learning: OpenAIResponsesProvider (packages/core/src/providers/openai-responses/OpenAIResponsesProvider.ts) currently bypasses the ephemeral truncation system by using direct `JSON.stringify(toolResponseBlock.result)` and needs to be updated to support ephemeral settings like the other providers.
📚 Learning: 2025-12-18T14:06:22.557Z
Learnt from: acoliver
Repo: vybestack/llxprt-code PR: 0
File: :0-0
Timestamp: 2025-12-18T14:06:22.557Z
Learning: OpenAIResponsesProvider (packages/core/src/providers/openai-responses/OpenAIResponsesProvider.ts) currently bypasses the ephemeral truncation system by using direct `JSON.stringify(toolResponseBlock.result)` and needs to be updated to support ephemeral settings like the other providers.
Applied to files:
packages/core/src/providers/openai/parseResponsesStream.tspackages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts
📚 Learning: 2025-11-16T22:51:26.374Z
Learnt from: e2720pjk
Repo: vybestack/llxprt-code PR: 583
File: packages/core/src/providers/openai/OpenAIProvider.ts:935-959
Timestamp: 2025-11-16T22:51:26.374Z
Learning: In the llxprt-code codebase (packages/core/src/providers/openai/OpenAIProvider.ts), tools like `run_shell_command` use internal streaming only for real-time UI updates during execution, but each tool execution produces exactly ONE final `ToolResponseBlock` containing the full result. The streaming chunks are never sent to the LLM and are not converted into multiple tool messages. The OpenAI Chat Completions API requires that each tool call (tool_call_id) corresponds to exactly one message with role 'tool', so duplicate tool response detection that removes subsequent tool messages with the same tool_call_id is correct and necessary for API compliance.
Applied to files:
packages/core/src/providers/openai/parseResponsesStream.tspackages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts
🧬 Code graph analysis (1)
packages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts (1)
packages/core/src/providers/openai/parseResponsesStream.ts (1)
parseResponsesStream(51-237)
⏰ Context from checks skipped due to timeout of 270000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Test (ubuntu-latest, 24.x)
- GitHub Check: Test (macos-latest, 24.x)
- GitHub Check: E2E Test (Linux) - sandbox:none
- GitHub Check: E2E Test (Linux) - sandbox:docker
- GitHub Check: E2E Test (macOS)
- GitHub Check: Slow E2E - Win
🔇 Additional comments (4)
packages/core/src/providers/openai/parseResponsesStream.ts (2)
2-4: Doc update matches new behavior.Clear summary of the added reasoning/thinking handling.
187-202: Reasoning flush before usage looks correct.Emitting the thinking block prior to usage metadata aligns with the stated ordering requirement.
packages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts (2)
4-17: SSE stream helper is clean and deterministic.The helper makes chunked SSE tests readable and reliable.
20-196: Test coverage is thorough.Covers reasoning-only, interleaving with text/tool calls, whitespace suppression, accumulation, usage metadata, and ordering.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-24.x-ubuntu-latest' artifact from the main CI run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/core/src/providers/openai/parseResponsesStream.ts`:
- Around line 96-120: The code merges both response.reasoning_text.delta and
response.reasoning_summary_text.delta into a single buffer (reasoningText)
causing raw reasoning and summarized reasoning to be combined; update the
handler in parseResponsesStream (the switch cases for
'response.reasoning_text.delta', 'response.reasoning_summary_text.delta',
'response.reasoning_text.done', and 'response.reasoning_summary_text.done') to
maintain two separate accumulators (e.g., reasoningText and
reasoningSummaryText) and on their respective *.done events yield distinct
thinking blocks using the appropriate buffer (or event.text fallback) before
clearing each buffer.
🧹 Nitpick comments (1)
packages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts (1)
20-219: Good test coverage overall.The test suite comprehensively covers the acceptance criteria from issue
#922:
- Reasoning-only streams ✓
- Interleaved reasoning + text + tool calls ✓
- Empty/whitespace reasoning chunks ✓
- Reasoning emitted before usage metadata ✓
Consider adding a test for
response.done(Codex variant) to explicitly verify parity withresponse.completed, though they share the same code path.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/core/src/providers/openai/parseResponsesStream.reasoning.test.tspackages/core/src/providers/openai/parseResponsesStream.ts
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: e2720pjk
Repo: vybestack/llxprt-code PR: 583
File: packages/core/src/providers/openai/OpenAIProvider.ts:935-959
Timestamp: 2025-11-16T22:51:26.374Z
Learning: In the llxprt-code codebase (packages/core/src/providers/openai/OpenAIProvider.ts), tools like `run_shell_command` use internal streaming only for real-time UI updates during execution, but each tool execution produces exactly ONE final `ToolResponseBlock` containing the full result. The streaming chunks are never sent to the LLM and are not converted into multiple tool messages. The OpenAI Chat Completions API requires that each tool call (tool_call_id) corresponds to exactly one message with role 'tool', so duplicate tool response detection that removes subsequent tool messages with the same tool_call_id is correct and necessary for API compliance.
Learnt from: acoliver
Repo: vybestack/llxprt-code PR: 0
File: :0-0
Timestamp: 2025-12-18T14:06:22.557Z
Learning: OpenAIResponsesProvider (packages/core/src/providers/openai-responses/OpenAIResponsesProvider.ts) currently bypasses the ephemeral truncation system by using direct `JSON.stringify(toolResponseBlock.result)` and needs to be updated to support ephemeral settings like the other providers.
📚 Learning: 2025-12-18T14:06:22.557Z
Learnt from: acoliver
Repo: vybestack/llxprt-code PR: 0
File: :0-0
Timestamp: 2025-12-18T14:06:22.557Z
Learning: OpenAIResponsesProvider (packages/core/src/providers/openai-responses/OpenAIResponsesProvider.ts) currently bypasses the ephemeral truncation system by using direct `JSON.stringify(toolResponseBlock.result)` and needs to be updated to support ephemeral settings like the other providers.
Applied to files:
packages/core/src/providers/openai/parseResponsesStream.tspackages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts
📚 Learning: 2025-11-16T22:51:26.374Z
Learnt from: e2720pjk
Repo: vybestack/llxprt-code PR: 583
File: packages/core/src/providers/openai/OpenAIProvider.ts:935-959
Timestamp: 2025-11-16T22:51:26.374Z
Learning: In the llxprt-code codebase (packages/core/src/providers/openai/OpenAIProvider.ts), tools like `run_shell_command` use internal streaming only for real-time UI updates during execution, but each tool execution produces exactly ONE final `ToolResponseBlock` containing the full result. The streaming chunks are never sent to the LLM and are not converted into multiple tool messages. The OpenAI Chat Completions API requires that each tool call (tool_call_id) corresponds to exactly one message with role 'tool', so duplicate tool response detection that removes subsequent tool messages with the same tool_call_id is correct and necessary for API compliance.
Applied to files:
packages/core/src/providers/openai/parseResponsesStream.tspackages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts
🧬 Code graph analysis (1)
packages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts (1)
packages/core/src/providers/openai/parseResponsesStream.ts (1)
parseResponsesStream(52-240)
⏰ Context from checks skipped due to timeout of 270000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Test (macos-latest, 24.x)
- GitHub Check: Test (windows-latest, 24.x)
- GitHub Check: Test (ubuntu-latest, 24.x)
- GitHub Check: E2E Test (Linux) - sandbox:docker
- GitHub Check: E2E Test (Linux) - sandbox:none
- GitHub Check: Slow E2E - Win
- GitHub Check: E2E Test (macOS)
🔇 Additional comments (10)
packages/core/src/providers/openai/parseResponsesStream.ts (2)
104-120: LGTM!The done event handling correctly:
- Uses
event.textas an override when provided (final complete text from API), falling back to accumulatedreasoningText- Applies
.trim()check to avoid emitting empty/whitespace-only thinking blocks- Properly resets the accumulator after yielding
190-205: LGTM!Good defensive handling that flushes any pending reasoning content before emitting usage metadata. This correctly handles edge cases where reasoning delta events arrive but no explicit
.doneevent follows before stream completion.packages/core/src/providers/openai/parseResponsesStream.reasoning.test.ts (8)
4-18: LGTM!Clean and reusable test helper for simulating SSE streams. The pull-based approach correctly mimics how chunks would arrive from a real network stream.
21-47: LGTM!Comprehensive test for the basic reasoning flow. Good assertions on the accumulated thought content and sourceField.
49-88: LGTM!Good integration test covering the interleaving scenario with reasoning, text output, and tool calls. The assertions verify all three content types are correctly parsed.
90-112: LGTM!Good edge case coverage for empty/whitespace-only reasoning. The JSON-escaped
\\n\\tcorrectly becomes actual whitespace characters after parsing.
114-141: LGTM!Good test verifying that multiple reasoning deltas are correctly accumulated into a single thinking block rather than emitting multiple blocks.
143-172: LGTM!Good integration test covering the full flow with reasoning, text, and usage metadata. Also verifies the
cachedTokensdefault behavior.
174-196: LGTM!Good test coverage for
reasoning_summary_textevent type, including verification that thetextfield from the done event is correctly used.
198-219: LGTM!Critical test verifying the flush behavior when reasoning deltas arrive without an explicit done event before stream completion. This ensures the PR objective of emitting reasoning before usage metadata is met.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
…xt per CodeRabbit
Summary\n- parse Responses API reasoning SSE events into ThinkingBlocks\n- buffer reasoning deltas and emit before usage metadata\n- add unit coverage for reasoning-only, interleaved, and edge cases\n\n## Testing\n- npm run format\n- npm run lint\n- npm run typecheck\n- npm run test\n- npm run build\n- node scripts/start.js --profile-load synthetic --prompt "write me a haiku"\n\nfixes #922