Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Remove compulsory include_usage when stream=true in gateway #757

Closed

Conversation

gau-nernst
Copy link

Pull Request Description

When stream=true, OpenAI API does not require stream_options to be specified. This will work

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-4o",
  "stream": true,
  "messages": [{"role": "user", "content": "help me write a random generator in python"}]
}'

However, currently when stream=true, AIBrix gateway specifically checks for stream_options={"include_usage":true}. This PR simply removes the check.

Note from @Jeffwan

Some features like heterogenous feature relies on the usage to be reported. We probably need some docs changes in the feature page to claim that include_usage is needed.

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@Jeffwan Jeffwan requested a review from varungup90 February 27, 2025 06:01
@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 27, 2025

/assign @varungup90

@gau-nernst gau-nernst force-pushed the gateway_stream_include_usage branch from 8413b57 to 4db03cb Compare February 27, 2025 06:07
@varungup90
Copy link
Collaborator

If the user has enabled rpm/tpm validation then we need to have include usage. To make include_usage optional will need check on whether user has enabled rpm/tpm limit check.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 27, 2025

For futures relies on usage statistics, can we add in the documentation to ask them enable it explicitly? heterogenous feature need it as well. By default, it should be clean

@gau-nernst
Copy link
Author

@varungup90 @Jeffwan Let me know how you want me to add the checks and how to test them. I'm eager to contribute, but if it's too complicated, I can close this PR and you can open your own.

Another question. When include_usage is required, is it possible to send include_usage=true to inference pods, but the gateway will post-process the response to make it comply with include_usage=false if the request says so? Because what I'm seeing is that if AIBrix users use features that require include_usage (rpm/tpm validation and heterogeneous GPUs), the server is not exactly OpenAI-compatible?

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 28, 2025

@varungup90 could you give more suggestions on the tpm check? Let's get @gau-nernst onboard.

@varungup90
Copy link
Collaborator

  1. I want to understand where is the blocker if we mandate to include stream usage. For client, if they do not want to consume usage report then it is still OK to include in the request.

  2. For implementation, there are two alternatives, first is to add another header same as "routing-strategy" which I feel will make input request bulky or complicated. Second option is that if user has enabled rpm/tpm validation or request tracing then mandate stream_usage check.

  3. If we decide to make include_usage optional then major changes will be in HandleResponseBody to adjust for rpm/tpm check and heterogeneous tracing feature. Given the current lack of e2e test, implementation need to be done carefully.

@gau-nernst
Copy link
Author

I want to understand where is the blocker if we mandate to include stream usage.

I think the biggest issue is that it's not 100% OpenAI-compatible. Client code that does not expect include_stream=true might not work (from what I understand, there will be an extra last chunk with empty choices and not-null usage. If client code does not handle this, it may break). Actually I discovered this issue when trying to use SGLang's sglang.bench_serving for benchmarking AIBrix. Of course I could modify SGLang's specific code, but the issue regarding general client code is still there. Additionally, sometimes it's not possible to modify client code.

From OpenAI doc https://platform.openai.com/docs/api-reference/chat/create

If set, an additional chunk will be streamed before the data: [DONE] message. The usage field on this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value.

Perhaps another option is to always send include_stream=true to inference pods (vLLM), but the gateway may skip the last usage statistics chunk if the client does not request it?

@varungup90
Copy link
Collaborator

varungup90 commented Mar 4, 2025

I have started a PR to make include_usage as optional param by default. If user's TPM limit is configured then include_usage is required.

Heterogenous use case is not supported with streaming right now. Once the feature is added, include_usage should be enabled as well.

@gau-nernst gau-nernst closed this Mar 4, 2025
@gau-nernst gau-nernst deleted the gateway_stream_include_usage branch March 4, 2025 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants