Skip to content

Conversation

yannicks1
Copy link
Collaborator

@yannicks1 yannicks1 commented Oct 10, 2025

set new_tokens to to max possible value given the Spyre constraints in platform.get_max_output_tokens() for continuous batching: max_model_len - padded_prompt_len .

Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Copy link
Collaborator

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Can you confirm that this allows you to send a /v1/chat/completions request without setting max_tokens?

Copy link
Collaborator

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a few tests with this change. I used chat_template to have full control over the input tokens with the chat endpoint:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d'{
        "messages":[{"role":"user","content":""}],
        "model":"ibm-ai-platform/micro-g3.3-8b-instruct-1b",
        "chat_template": "{{ \"A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A \" }}"
    }'

Before this change, the prompt had to be 63 or 64 tokens to be accepted (it is interesting that 63 also worked 🤔)

After this change, that restriction no longer exists and I can freely send requests without worrying about the page boundaries 🚀

A potential source of confusion: the way this is applied is that it silently overrides the value for max_tokens set on the request. If the request sets min_tokens and max_tokens to the same value and this is higher than would fit, this overrides max_tokens and results in a min_tokens must be less than or equal to max_tokens error.

@joerunde
Copy link
Collaborator

Nice!
Gonna go ahead and merge, because I am of course trying to slide in a little release on a Friday afternoon

@joerunde joerunde merged commit 00ec338 into main Oct 10, 2025
20 checks passed
@joerunde joerunde deleted the ysc-fix-max-tokens-chat branch October 10, 2025 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants