Coalesce text diffs in streaming requests. #4923

pathorn · 2025-06-05T00:37:49Z

fix/hack: Coalesce text diffs in streaming requests.

Description

Sometimes, streaming output from openai_server will produce the same message twice and skip another message.
For example, this was a packet capture from an example bad request:

138\r\ndata: {"id":"cmpl-ef21ea5a4fe640f1bea24729b7a0b07d","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":"1. ","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":160}}\n\n\r\n",
138\r\ndata: {"id":"cmpl-6b5546146cdf41f7a0b64b2de0309288","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":" Clarify","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":161}}\n\n\r\n",
...
"139\r\ndata: {"id":"cmpl-1ff642a3baa14acf980a8632e744f46c","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5fc9048a4a9f424b9518c976083a369f","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5782671ed8a54e4383c0ddbac6a82b68","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-cd64d34f1091405797f53f370064e63b","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}

In particular, completion_tokens is calculated using the current value of output.length, and since output is the same object at each iteration, this will always be the current output length, which is why it always says 172 after the lag spike in the above example.

As for text_diff, it uses an internal property of the GenerationResultBase that stores the last position. My suspicion is this last position is increased even if the generator is not being consumed, which is why there is data loss and duplicate tokens. To solve this, I keep track of the last text pos in the params object which is local to the request, and pass it into the GenerationResultBase getter. Finally, this would create a situation where you have one packet with the entire text diff, followed by a bunch of empty updates, so I added a hacky check to remove the empty updates.

This change is a little bit of a hack. It doesn't address the root cause of state updates being interleaved with streaming generator polling, but it kind of prevents these missed packets from producing corrupted output in the frontend. I would prefer a better change, but this is what I was able to come up with.

Test Coverage

Run DeepSeek-R1-FP4
Run several extremely large prefills while performing streaming requests, to create high load. If you time it right, without this patch it will skip one packet and produce a duplicate with the same completion_tokens on each.

(Also, for some reason, I was unable to reproduce the issue with curl, but only with python aiohttp. I don't know why.)

Signed-off-by: Patrick Reiter Horn <[email protected]>

LinPoly · 2025-06-11T12:34:25Z

@pathorn Thanks for your contribution, did you see this token repetition issue on chat API as well? I think we can try to figure out the root cause within a time window, and if we cannot fix it presumably before the next release, then we can merge this PR as a WAR. thanks again for reporting the issue and giving a solution.

Shang-Pin · 2025-06-18T21:42:56Z

@LinPoly We observed this issue happening for both chat and completion API. We suspect the output object is being updated even while the post_processor is running as we are seeing the create_logprobs_completion crash because the lengths do not match.

LinPoly · 2025-06-25T07:37:26Z

@Shang-Pin Do you happen to have reproduction script for chat or completion endpoint? Thanks!

poweiw added triaged Issue has been triaged by maintainers OpenAI API trtllm-serve's OpenAI-compatible API: endpoint behavior, req/resp formats, feature parity. labels Jun 5, 2025

poweiw assigned LinPoly Jun 5, 2025

poweiw added the Community want to contribute PRs initiated from Community label Jun 5, 2025

pathorn force-pushed the coalesce-streaming-diff branch from 940b2f5 to 057bf0f Compare June 5, 2025 22:35

Coalesce text diffs in streaming requests.

3bc85f8

Signed-off-by: Patrick Reiter Horn <[email protected]>

pathorn force-pushed the coalesce-streaming-diff branch from 057bf0f to 3bc85f8 Compare June 5, 2025 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coalesce text diffs in streaming requests. #4923

Coalesce text diffs in streaming requests. #4923

Uh oh!

pathorn commented Jun 5, 2025

Uh oh!

LinPoly commented Jun 11, 2025 •

edited

Loading

Uh oh!

Shang-Pin commented Jun 18, 2025

Uh oh!

LinPoly commented Jun 25, 2025

Uh oh!

Uh oh!

Coalesce text diffs in streaming requests. #4923

Are you sure you want to change the base?

Coalesce text diffs in streaming requests. #4923

Uh oh!

Conversation

pathorn commented Jun 5, 2025

fix/hack: Coalesce text diffs in streaming requests.

Description

Test Coverage

Uh oh!

LinPoly commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shang-Pin commented Jun 18, 2025

Uh oh!

LinPoly commented Jun 25, 2025

Uh oh!

Uh oh!

LinPoly commented Jun 11, 2025 •

edited

Loading