Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Nvidia's ModelOpt fp8 quantized models #2535

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Edwardf0t1
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 commented Dec 21, 2024

Motivation

As discussed in our sync meeting @merrymercy @Ying1123 , we aim to contribute to SGLang by integrating NVIDIA's TensorRT Model Optimizer (ModelOpt) with optimized and quantized models, fostering collaboration to enhance the open-source inference ecosystem.

Modifications

This PR serves as an initial step toward adding support for ModelOpt quantized models in SGLang, starting with FP8 LLaMA 3.1 model inference. A basic test can be executed using the script provided below.

import sglang as sgl

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt")

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
    main()

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved
@zhyncs
Copy link
Member

zhyncs commented Dec 21, 2024

@Edwardf0t1 Please help resolve the conflicts

@Edwardf0t1
Copy link
Contributor Author

@Edwardf0t1 Please help resolve the conflicts

Done

python/pyproject.toml Outdated Show resolved Hide resolved
@zhyncs zhyncs added high priority quant LLM Quantization and removed await-response labels Dec 31, 2024
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 84aee77 to e847fac Compare December 31, 2024 18:11
Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please fix the CI errors.

@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi",
"psutil", "pydantic", "python-multipart",
"pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop",
"xgrammar>=0.1.6"]
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10"]
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10", "nvidia-modelopt"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this an optional dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is already under [project.optional-dependencies]. Also see this comment from @zhyncs :
#2535 (comment)

@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 687ae9b to 4cecb9c Compare January 3, 2025 00:38
@Edwardf0t1
Copy link
Contributor Author

Hi @merrymercy I left a comment in your recently merged PR that I found it could bring issues in my test when run llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt").

@merrymercy
Copy link
Contributor

merrymercy commented Jan 3, 2025

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

@Edwardf0t1
Copy link
Contributor Author

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

I can use ens8np0 or enp2s0 interface for GLOO_SOCKET_IFNAME, depending on the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority quant LLM Quantization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants