Skip to content

Coalesce text diffs in streaming requests. #4923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pathorn
Copy link
Contributor

@pathorn pathorn commented Jun 5, 2025

fix/hack: Coalesce text diffs in streaming requests.

Description

Sometimes, streaming output from openai_server will produce the same message twice and skip another message.
For example, this was a packet capture from an example bad request:

138\r\ndata: {"id":"cmpl-ef21ea5a4fe640f1bea24729b7a0b07d","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":"1. ","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":160}}\n\n\r\n",
138\r\ndata: {"id":"cmpl-6b5546146cdf41f7a0b64b2de0309288","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":" Clarify","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":161}}\n\n\r\n",
...
"139\r\ndata: {"id":"cmpl-1ff642a3baa14acf980a8632e744f46c","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5fc9048a4a9f424b9518c976083a369f","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5782671ed8a54e4383c0ddbac6a82b68","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-cd64d34f1091405797f53f370064e63b","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}

In particular, completion_tokens is calculated using the current value of output.length, and since output is the same object at each iteration, this will always be the current output length, which is why it always says 172 after the lag spike in the above example.

As for text_diff, it uses an internal property of the GenerationResultBase that stores the last position. My suspicion is this last position is increased even if the generator is not being consumed, which is why there is data loss and duplicate tokens. To solve this, I keep track of the last text pos in the params object which is local to the request, and pass it into the GenerationResultBase getter. Finally, this would create a situation where you have one packet with the entire text diff, followed by a bunch of empty updates, so I added a hacky check to remove the empty updates.

This change is a little bit of a hack. It doesn't address the root cause of state updates being interleaved with streaming generator polling, but it kind of prevents these missed packets from producing corrupted output in the frontend. I would prefer a better change, but this is what I was able to come up with.

Test Coverage

Run DeepSeek-R1-FP4
Run several extremely large prefills while performing streaming requests, to create high load. If you time it right, without this patch it will skip one packet and produce a duplicate with the same completion_tokens on each.

(Also, for some reason, I was unable to reproduce the issue with curl, but only with python aiohttp. I don't know why.)

@poweiw poweiw added triaged Issue has been triaged by maintainers OpenAI API trtllm-serve's OpenAI-compatible API: endpoint behavior, req/resp formats, feature parity. labels Jun 5, 2025
@poweiw poweiw added the Community want to contribute PRs initiated from Community label Jun 5, 2025
@pathorn pathorn force-pushed the coalesce-streaming-diff branch from 940b2f5 to 057bf0f Compare June 5, 2025 22:35
@pathorn pathorn force-pushed the coalesce-streaming-diff branch from 057bf0f to 3bc85f8 Compare June 5, 2025 22:36
@LinPoly
Copy link
Collaborator

LinPoly commented Jun 11, 2025

@pathorn Thanks for your contribution, did you see this token repetition issue on chat API as well? I think we can try to figure out the root cause within a time window, and if we cannot fix it presumably before the next release, then we can merge this PR as a WAR. thanks again for reporting the issue and giving a solution.

@Shang-Pin
Copy link

@LinPoly We observed this issue happening for both chat and completion API. We suspect the output object is being updated even while the post_processor is running as we are seeing the create_logprobs_completion crash because the lengths do not match.

@LinPoly
Copy link
Collaborator

LinPoly commented Jun 25, 2025

@Shang-Pin Do you happen to have reproduction script for chat or completion endpoint? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community OpenAI API trtllm-serve's OpenAI-compatible API: endpoint behavior, req/resp formats, feature parity. triaged Issue has been triaged by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants