-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Nvidia's ModelOpt fp8 quantized models #2535
base: main
Are you sure you want to change the base?
Enable Nvidia's ModelOpt fp8 quantized models #2535
Conversation
@Edwardf0t1 Please help resolve the conflicts |
d95ae5e
to
1b98f9a
Compare
Done |
84aee77
to
e847fac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please fix the CI errors.
@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi", | |||
"psutil", "pydantic", "python-multipart", | |||
"pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop", | |||
"xgrammar>=0.1.6"] | |||
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10"] | |||
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10", "nvidia-modelopt"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this an optional dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is already under [project.optional-dependencies]
. Also see this comment from @zhyncs :
#2535 (comment)
687ae9b
to
4cecb9c
Compare
Hi @merrymercy I left a comment in your recently merged PR that I found it could bring issues in my test when run |
I see. What is the correct value of |
I can use |
Motivation
As discussed in our sync meeting @merrymercy @Ying1123 , we aim to contribute to SGLang by integrating NVIDIA's TensorRT Model Optimizer (
ModelOpt
) with optimized and quantized models, fostering collaboration to enhance the open-source inference ecosystem.Modifications
This PR serves as an initial step toward adding support for
ModelOpt
quantized models in SGLang, starting with FP8 LLaMA 3.1 model inference. A basic test can be executed using the script provided below.Checklist