Skip to content

Conversation

@gkumbhat
Copy link
Collaborator

Description

  • Replace FP8 model name check with quantization check since user could choose to change their model name and then things can go into weird state, and if they are trying to use pre-compiled model, then it will go into re-compilation

Related Issues

@github-actions
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@gkumbhat gkumbhat marked this pull request as ready for review October 16, 2025 21:33
Copy link
Collaborator

@prashantgupta24 prashantgupta24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


# TODO: we need 2 requests for warmup on FP8+CB
is_fp8_plus_cb = 'FP8' in self.model_config.model and \
# Check if model is quantized
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other quantizations like 4 bit or 8 bit int?

There is some more code from @joerunde and some I recently wrote to do this kind of check, which could be adapted and used here:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to refactor all these checks once we support more quantization methods, today we only support fp8. I wouldn't mind a refactor to at least pull all of these fp8 checks into one helper instance method in the model class, but we don't need to block this fix on it

I do think this is a separate problem than figuring out which model we're serving though, because any fp8 model has to be handled separately here, not just granite specifically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a follow-up issue: #537

Copy link
Collaborator

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm too

@joerunde
Copy link
Collaborator

bot:test
MARKERS="spyre and quantized and not multi"

@joerunde
Copy link
Collaborator

I don't believe in the bot tests any more but we might as well try to kick one off to make sure nothing barfs in a new and unusual way

@joerunde joerunde merged commit 3e35a3a into vllm-project:main Oct 17, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants