Enable Nvidia's ModelOpt fp8 quantized models #2535

Edwardf0t1 · 2024-12-21T01:20:45Z

Motivation

As discussed in our sync meeting @merrymercy @Ying1123 , we aim to contribute to SGLang by integrating NVIDIA's TensorRT Model Optimizer (ModelOpt) with optimized and quantized models, fostering collaboration to enhance the open-source inference ecosystem.

Modifications

This PR serves as an initial step toward adding support for ModelOpt quantized models in SGLang, starting with FP8 LLaMA 3.1 model inference. A basic test can be executed using the script provided below.

import sglang as sgl

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt")

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
    main()

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

python/sglang/srt/layers/modelopt_quant.py

zhyncs · 2024-12-21T16:47:02Z

@Edwardf0t1 Please help resolve the conflicts

Edwardf0t1 · 2024-12-31T08:02:20Z

@Edwardf0t1 Please help resolve the conflicts

Done

python/pyproject.toml

merrymercy

Also, please fix the CI errors.

merrymercy · 2025-01-02T23:02:58Z

python/pyproject.toml

@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi",
    "psutil", "pydantic", "python-multipart",
    "pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop",
    "xgrammar>=0.1.6"]
-srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10"]
+srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10", "nvidia-modelopt"]


can we make this an optional dependency?

IIUC this is already under [project.optional-dependencies]. Also see this comment from @zhyncs :
#2535 (comment)

Edwardf0t1 · 2025-01-03T01:37:43Z

Hi @merrymercy I left a comment in your recently merged PR that I found it could bring issues in my test when run llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt").

merrymercy · 2025-01-03T08:35:15Z

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

Edwardf0t1 · 2025-01-03T19:58:54Z

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

I can use ens8np0 or enp2s0 interface for GLOO_SOCKET_IFNAME, depending on the system.

Edwardf0t1 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners December 21, 2024 01:20

zhyncs reviewed Dec 21, 2024

View reviewed changes

python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved

zhyncs added the await-response label Dec 23, 2024

merrymercy assigned zhyncs and ispobock Dec 26, 2024

Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from d95ae5e to 1b98f9a Compare December 31, 2024 07:52

zhyncs reviewed Dec 31, 2024

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

zhyncs added high priority quant LLM Quantization and removed await-response labels Dec 31, 2024

Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 84aee77 to e847fac Compare December 31, 2024 18:11

merrymercy reviewed Jan 2, 2025

View reviewed changes

Edwardf0t1 added 8 commits January 3, 2025 00:34

add modelopt dependency

273fee6

enable modelopt quant in sglang

04ed027

add a test script

4df78db

refine modelopt quant script

57cb97b

update modelopt_quant

876ab12

remove simple test script in root dir

decb85f

fix format

6b62dd8

relocate nvidia-modelopt dependency to srt

4cecb9c

Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 687ae9b to 4cecb9c Compare January 3, 2025 00:38

fix CI

b616692

Edwardf0t1 mentioned this pull request Jan 3, 2025

[Docs] clean up structured outputs docs #2654

Merged

Merge branch 'main' into zhiyu/enable-modelopt-fp8

1f04ad1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Nvidia's ModelOpt fp8 quantized models #2535

Enable Nvidia's ModelOpt fp8 quantized models #2535

Edwardf0t1 commented Dec 21, 2024 •

edited

Loading

zhyncs commented Dec 21, 2024

Edwardf0t1 commented Dec 31, 2024

merrymercy left a comment

merrymercy Jan 2, 2025

Edwardf0t1 Jan 3, 2025

Edwardf0t1 commented Jan 3, 2025

merrymercy commented Jan 3, 2025 •

edited

Loading

Edwardf0t1 commented Jan 3, 2025

Enable Nvidia's ModelOpt fp8 quantized models #2535

Are you sure you want to change the base?

Enable Nvidia's ModelOpt fp8 quantized models #2535

Conversation

Edwardf0t1 commented Dec 21, 2024 • edited Loading

Motivation

Modifications

Checklist

zhyncs commented Dec 21, 2024

Edwardf0t1 commented Dec 31, 2024

merrymercy left a comment

Choose a reason for hiding this comment

merrymercy Jan 2, 2025

Choose a reason for hiding this comment

Edwardf0t1 Jan 3, 2025

Choose a reason for hiding this comment

Edwardf0t1 commented Jan 3, 2025

merrymercy commented Jan 3, 2025 • edited Loading

Edwardf0t1 commented Jan 3, 2025

Edwardf0t1 commented Dec 21, 2024 •

edited

Loading

merrymercy commented Jan 3, 2025 •

edited

Loading