Coalesce text diffs in streaming requests. #4923
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix/hack: Coalesce text diffs in streaming requests.
Description
Sometimes, streaming output from openai_server will produce the same message twice and skip another message.
For example, this was a packet capture from an example bad request:
138\r\ndata: {"id":"cmpl-ef21ea5a4fe640f1bea24729b7a0b07d","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":"1. ","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":160}}\n\n\r\n",
138\r\ndata: {"id":"cmpl-6b5546146cdf41f7a0b64b2de0309288","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":" Clarify","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":305,"completion_tokens":161}}\n\n\r\n",
...
"139\r\ndata: {"id":"cmpl-1ff642a3baa14acf980a8632e744f46c","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5fc9048a4a9f424b9518c976083a369f","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-5782671ed8a54e4383c0ddbac6a82b68","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}\n\n\r\n139\r\ndata: {"id":"cmpl-cd64d34f1091405797f53f370064e63b","object":"text_completion","created":1748849646,"model":"nvidia/DeepSeek-R1-FP4","choices":[{"index":0,"text":", there","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":144,"total_tokens":316,"completion_tokens":172}}
In particular, completion_tokens is calculated using the current value of
output.length
, and since output is the same object at each iteration, this will always be the current output length, which is why it always says 172 after the lag spike in the above example.As for
text_diff
, it uses an internal property of the GenerationResultBase that stores the last position. My suspicion is this last position is increased even if the generator is not being consumed, which is why there is data loss and duplicate tokens. To solve this, I keep track of the last text pos in the params object which is local to the request, and pass it into the GenerationResultBase getter. Finally, this would create a situation where you have one packet with the entire text diff, followed by a bunch of empty updates, so I added a hacky check to remove the empty updates.This change is a little bit of a hack. It doesn't address the root cause of state updates being interleaved with streaming generator polling, but it kind of prevents these missed packets from producing corrupted output in the frontend. I would prefer a better change, but this is what I was able to come up with.
Test Coverage
Run DeepSeek-R1-FP4
Run several extremely large prefills while performing streaming requests, to create high load. If you time it right, without this patch it will skip one packet and produce a duplicate with the same completion_tokens on each.
(Also, for some reason, I was unable to reproduce the issue with curl, but only with python aiohttp. I don't know why.)