Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support TransformerEngine to enable communication overlap #2627

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

Zhuohao-Li
Copy link

@Zhuohao-Li Zhuohao-Li commented Dec 28, 2024

Motivation

  • Support NVIDIA TransformerEngine (TE) library in sglang

Modifications

  • Add TE support for llama models (entry classTELlamaForCausalLM)
  • Add enable-te config in server args
  • Add benchmark results for TELlama models

TODO

  • Evaluate and tune the performance increase with various configs and dedicate the performance gain.
  • Support more models wih unified configs
  • Add te.fp8 support

How to run

(more details can be referred to /comm-overlapp/README)

build env:

docker pull zhuohaol/sglang-te:latest

docker run -it --shm-size 32g --gpus all -p 30001:30001 --ipc=host --rm zhuohaol/sglang-te:latest

git clone https://github.com/Zhuohao-Li/sglang/tree/zhuohaol-comm-overlap

cd sglang/python

huggingface-cli login

eval:

python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 1 16 64 128 --input-len 256 512 --output-len 32 256 --run-name test_run --tp 4 --enable-te

lanuch server:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0 --tp 4 --enable-te 

max_total_num_tokens=2006969
Warmup ...
Prefill. latency: 1.05858 s, throughput:    241.83 token/s
Decode.  latency: 0.00876 s, throughput:    114.21 token/s
Decode.  latency: 0.00390 s, throughput:    256.56 token/s
Decode.  latency: 0.00385 s, throughput:    259.68 token/s
Decode.  latency: 0.00382 s, throughput:    261.54 token/s
Decode.  latency: 0.00391 s, throughput:    255.49 token/s
Decode.  median latency: 0.00385 s, median throughput:    259.68 token/s
Total. latency:  1.090 s, throughput:    242.10 token/s
Benchmark ...
Prefill. latency: 0.02279 s, throughput:  11232.08 token/s
Decode.  latency: 0.00397 s, throughput:    252.00 token/s
Decode.  latency: 0.00385 s, throughput:    259.44 token/s
Decode.  latency: 0.00383 s, throughput:    261.34 token/s
Decode.  latency: 0.00381 s, throughput:    262.21 token/s
Decode.  latency: 0.00382 s, throughput:    261.78 token/s
Decode.  median latency: 0.00383 s, median throughput:    261.38 token/s
Total. latency:  0.141 s, throughput:   2035.53 token/s
Prefill. latency: 0.02370 s, throughput:  10801.25 token/s
Decode.  latency: 0.00396 s, throughput:    252.79 token/s
Decode.  latency: 0.00383 s, throughput:    261.02 token/s
Decode.  latency: 0.00383 s, throughput:    261.25 token/s
Decode.  latency: 0.00382 s, throughput:    261.49 token/s
Decode.  latency: 0.00380 s, throughput:    263.08 token/s
Decode.  median latency: 0.00381 s, median throughput:    262.41 token/s
Total. latency:  1.002 s, throughput:    511.06 token/s
Prefill. latency: 0.02941 s, throughput:  17410.81 token/s
Decode.  latency: 0.00401 s, throughput:    249.60 token/s
Decode.  latency: 0.00381 s, throughput:    262.64 token/s
Decode.  latency: 0.00382 s, throughput:    262.05 token/s
Decode.  latency: 0.00380 s, throughput:    263.48 token/s
Decode.  latency: 0.00384 s, throughput:    260.56 token/s
Decode.  median latency: 0.00382 s, median throughput:    262.11 token/s
Total. latency:  0.148 s, throughput:   3678.89 token/s
Prefill. latency: 0.02453 s, throughput:  20871.44 token/s
Decode.  latency: 0.00391 s, throughput:    255.59 token/s
Decode.  latency: 0.00381 s, throughput:    262.67 token/s
Decode.  latency: 0.00384 s, throughput:    260.29 token/s
Decode.  latency: 0.00382 s, throughput:    261.52 token/s
Decode.  latency: 0.00384 s, throughput:    260.66 token/s
Decode.  median latency: 0.00381 s, median throughput:    262.13 token/s
Total. latency:  0.999 s, throughput:    768.66 token/s
Prefill. latency: 0.35192 s, throughput:  11638.97 token/s
Decode.  latency: 0.00738 s, throughput:   2167.53 token/s
Decode.  latency: 0.00432 s, throughput:   3702.76 token/s
Decode.  latency: 0.00432 s, throughput:   3704.19 token/s
Decode.  latency: 0.00430 s, throughput:   3718.15 token/s
Decode.  latency: 0.00431 s, throughput:   3711.16 token/s
Decode.  median latency: 0.00431 s, median throughput:   3712.80 token/s
Total. latency:  0.489 s, throughput:   9429.91 token/s
Prefill. latency: 0.04450 s, throughput:  92039.78 token/s
Decode.  latency: 0.00431 s, throughput:   3714.03 token/s
Decode.  latency: 0.00428 s, throughput:   3735.33 token/s
Decode.  latency: 0.00432 s, throughput:   3706.85 token/s
Decode.  latency: 0.00427 s, throughput:   3743.87 token/s
Decode.  latency: 0.00431 s, throughput:   3715.27 token/s
Decode.  median latency: 0.00427 s, median throughput:   3743.66 token/s
Total. latency:  1.137 s, throughput:   7202.44 token/s
Prefill. latency: 0.08355 s, throughput:  98047.98 token/s
Decode.  latency: 0.00440 s, throughput:   3632.81 token/s
Decode.  latency: 0.00433 s, throughput:   3691.36 token/s
Decode.  latency: 0.00429 s, throughput:   3725.79 token/s
Decode.  latency: 0.00431 s, throughput:   3715.27 token/s
Decode.  latency: 0.00434 s, throughput:   3689.53 token/s
Decode.  median latency: 0.00433 s, median throughput:   3694.40 token/s
Total. latency:  0.218 s, throughput:  39932.16 token/s
Prefill. latency: 0.07986 s, throughput: 102577.10 token/s
Decode.  latency: 0.00439 s, throughput:   3643.06 token/s
Decode.  latency: 0.00433 s, throughput:   3693.18 token/s
Decode.  latency: 0.00433 s, throughput:   3698.48 token/s
Decode.  latency: 0.00433 s, throughput:   3694.40 token/s
Decode.  latency: 0.00430 s, throughput:   3722.27 token/s
Decode.  median latency: 0.00433 s, median throughput:   3694.40 token/s
Total. latency:  1.184 s, throughput:  10375.67 token/s
Prefill. latency: 0.15013 s, throughput: 109133.45 token/s
Decode.  latency: 0.00810 s, throughput:   7900.74 token/s
Decode.  latency: 0.00531 s, throughput:  12042.32 token/s
Decode.  latency: 0.00530 s, throughput:  12077.00 token/s
Decode.  latency: 0.00529 s, throughput:  12098.23 token/s
Decode.  latency: 0.00530 s, throughput:  12086.79 token/s
Decode.  median latency: 0.00529 s, median throughput:  12092.23 token/s
Total. latency:  0.317 s, throughput:  58088.33 token/s
Prefill. latency: 0.14845 s, throughput: 110368.27 token/s
Decode.  latency: 0.00550 s, throughput:  11644.27 token/s
Decode.  latency: 0.00536 s, throughput:  11950.65 token/s
Decode.  latency: 0.00531 s, throughput:  12053.14 token/s
Decode.  latency: 0.00532 s, throughput:  12028.29 token/s
Decode.  latency: 0.00537 s, throughput:  11920.93 token/s
Decode.  median latency: 0.00529 s, median throughput:  12096.59 token/s
Total. latency:  1.506 s, throughput:  21762.03 token/s
Prefill. latency: 0.30551 s, throughput: 107256.37 token/s
Decode.  latency: 0.00574 s, throughput:  11151.36 token/s
Decode.  latency: 0.00551 s, throughput:  11607.02 token/s
Decode.  latency: 0.00550 s, throughput:  11646.30 token/s
Decode.  latency: 0.00551 s, throughput:  11623.10 token/s
Decode.  latency: 0.00548 s, throughput:  11676.69 token/s
Decode.  median latency: 0.00545 s, median throughput:  11738.99 token/s
Total. latency:  0.475 s, throughput:  73270.12 token/s
Prefill. latency: 0.29033 s, throughput: 112863.12 token/s
Decode.  latency: 0.00586 s, throughput:  10913.34 token/s
Decode.  latency: 0.00555 s, throughput:  11528.75 token/s
Decode.  latency: 0.00551 s, throughput:  11620.08 token/s
Decode.  latency: 0.00550 s, throughput:  11630.15 token/s
Decode.  latency: 0.00554 s, throughput:  11552.57 token/s
Decode.  median latency: 0.00542 s, median throughput:  11817.54 token/s
Total. latency:  1.680 s, throughput:  29261.57 token/s
Prefill. latency: 0.29037 s, throughput: 112847.37 token/s
Decode.  latency: 0.00719 s, throughput:  17812.57 token/s
Decode.  latency: 0.00602 s, throughput:  21251.27 token/s
Decode.  latency: 0.00601 s, throughput:  21290.04 token/s
Decode.  latency: 0.00601 s, throughput:  21313.70 token/s
Decode.  latency: 0.00599 s, throughput:  21356.10 token/s
Decode.  median latency: 0.00590 s, median throughput:  21711.93 token/s
Total. latency:  0.477 s, throughput:  77236.81 token/s
Prefill. latency: 0.29080 s, throughput: 112681.30 token/s
Decode.  latency: 0.00628 s, throughput:  20377.70 token/s
Decode.  latency: 0.00595 s, throughput:  21497.19 token/s
Decode.  latency: 0.00596 s, throughput:  21473.98 token/s
Decode.  latency: 0.00592 s, throughput:  21631.45 token/s
Decode.  latency: 0.00589 s, throughput:  21718.96 token/s
Decode.  median latency: 0.00592 s, median throughput:  21631.45 token/s
Total. latency:  1.809 s, throughput:  36235.08 token/s
Prefill. latency: 0.57779 s, throughput: 113425.43 token/s
Decode.  latency: 0.00695 s, throughput:  18404.27 token/s
Decode.  latency: 0.00626 s, throughput:  20452.23 token/s
Decode.  latency: 0.00626 s, throughput:  20445.22 token/s
Decode.  latency: 0.00621 s, throughput:  20610.04 token/s
Decode.  latency: 0.00615 s, throughput:  20805.72 token/s
Decode.  median latency: 0.00616 s, median throughput:  20792.83 token/s
Total. latency:  0.770 s, throughput:  90477.79 token/s
Prefill. latency: 0.57930 s, throughput: 113130.21 token/s
Decode.  latency: 0.00685 s, throughput:  18682.87 token/s
Decode.  latency: 0.00636 s, throughput:  20128.63 token/s
Decode.  latency: 0.00624 s, throughput:  20516.31 token/s
Decode.  latency: 0.00618 s, throughput:  20728.61 token/s
Decode.  latency: 0.00618 s, throughput:  20715.01 token/s
Decode.  median latency: 0.00630 s, median throughput:  20306.02 token/s
Total. latency:  2.190 s, throughput:  44886.46 token/s

note: unrelated files will be removed after tunning and get the maximum performance gain

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant