Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance by removing use_tensor_core dependency #2496

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

bjmsong
Copy link
Collaborator

@bjmsong bjmsong commented Dec 17, 2024

Motivation

According to the discussion of this PR

Modifications

  • flasinfer<=0.1.6
kv_cache_dtype gqa_group_size before after
torch.float16, torch.half, torch.bfloat16 8 use_tensor_cores=False use_tensor_cores=True
torch.float8_e4m3fn, torch.float8_e5m2 [1,2,4,8] use_tensor_cores=False use_tensor_cores=True

E2E Test

  • A100(40GB)
python -m sglang.launch_server --model-path {Yi-1.5-9B-Chat}

python -m sglang.bench_serving --backend sglang --model {Yi-1.5-9B-Chat} --dataset-name random --random-output-len 4096 --num-prompts 32
  • Throughput: 931 toks/s -> 1171 toks/s
  • Median ITL: 20.56 ms -> 18.97 ms
# before
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  87.96     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    54217     
Request throughput (req/s):              0.36      
Input token throughput (tok/s):          194.49    
Output token throughput (tok/s):         736.98    
Total token throughput (tok/s):          931.47    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   46408.83  
Median E2E Latency (ms):                 39137.18  
---------------Time to First Token----------------
Mean TTFT (ms):                          1150.39   
Median TTFT (ms):                        1517.29   
P99 TTFT (ms):                           1581.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.67     
Median TPOT (ms):                        20.82     
P99 TPOT (ms):                           24.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.70     
Median ITL (ms):                         20.56     
P99 ITL (ms):                            39.10     
==================================================

# after
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  69.94     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    57593     
Request throughput (req/s):              0.46      
Input token throughput (tok/s):          244.62    
Output token throughput (tok/s):         926.93    
Total token throughput (tok/s):          1171.56   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39461.97  
Median E2E Latency (ms):                 37171.11  
---------------Time to First Token----------------
Mean TTFT (ms):                          1157.81   
Median TTFT (ms):                        1529.05   
P99 TTFT (ms):                           1592.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.20     
Median TPOT (ms):                        19.14     
P99 TPOT (ms):                           21.97     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.21     
Median ITL (ms):                         18.97     
P99 ITL (ms):                            22.05     
==================================================
python -m sglang.launch_server --model-path {Meta-Llama-3.1-8B-Instruct-FP8} --kv-cache-dtype fp8_e5m2

python -m sglang.bench_serving --backend sglang --model {Meta-Llama-3.1-8B-Instruct-FP8} --dataset-name random --random-output-len 4096 --num-prompts 32
  • Throughput: 1410 toks/s -> 1735 toks/s
  • Median ITL: 14.79 ms -> 11.89 ms
# before
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  58.09     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    64747     
Request throughput (req/s):              0.55      
Input token throughput (tok/s):          294.51    
Output token throughput (tok/s):         1115.97   
Total token throughput (tok/s):          1410.49   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   31550.18  
Median E2E Latency (ms):                 30155.63  
---------------Time to First Token----------------
Mean TTFT (ms):                          1366.20   
Median TTFT (ms):                        1652.97   
P99 TTFT (ms):                           1671.34   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.98     
Median TPOT (ms):                        15.06     
P99 TPOT (ms):                           15.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.91     
Median ITL (ms):                         14.79     
P99 ITL (ms):                            17.01     
==================================================

# after
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  47.21     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    64799     
Request throughput (req/s):              0.68      
Input token throughput (tok/s):          362.41    
Output token throughput (tok/s):         1373.26   
Total token throughput (tok/s):          1735.67   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25458.49  
Median E2E Latency (ms):                 24172.10  
---------------Time to First Token----------------
Mean TTFT (ms):                          1374.96   
Median TTFT (ms):                        1651.89   
P99 TTFT (ms):                           1669.89   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.07     
Median TPOT (ms):                        12.01     
P99 TPOT (ms):                           13.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.89     
Median ITL (ms):                         11.89     
P99 ITL (ms):                            13.27     
==================================================

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@merrymercy
Copy link
Contributor

Previously, we found an accuracy regression if we remove this. Did you test the accuracy of all models?

@bjmsong
Copy link
Collaborator Author

bjmsong commented Dec 18, 2024

result of test_eval_accuracy_large.py

Model kv-cache-dtype eval_name Before After
Yi-1.5-9B-Chat auto mmlu 0.641 0.645
humaneval 0.616 0.615
mgsm_en 0.868 0.880
Meta-Llama-3.1-8B-Instruct-FP8 fp8_e5m2 mmlu 0.706 0.707
humaneval 0.666 0.660
mgsm_en 0.856 0.852

@merrymercy
Copy link
Contributor

What is your hardware?

@bjmsong
Copy link
Collaborator Author

bjmsong commented Dec 27, 2024

A100(40GB)

@merrymercy
Copy link
Contributor

Can you test the command in this PR #1511?

If it passes, we can merge this.

@bjmsong
Copy link
Collaborator Author

bjmsong commented Dec 27, 2024

python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tp 4 --enable-p2p-check

python python/sglang/test/run_eval.py --host 127.0.0.1 --port 30000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --eval-name humaneval

Run the test on L20*4 twice, here's the result:

Round Before After
First 0.751 0.750
Second 0.748 0.750

@merrymercy
Copy link
Contributor

Thanks for testing this! The previous bug only happens on certain hardware. We reproduced it once on 8xA100 40GB. That is why I am asking which hardware you are using. Is it possible for you to run it again on 8xA100 40GB with the exact command?

Especially @yzh119 also acknowledged it is a bug so we need to very careful here.

@merrymercy merrymercy self-assigned this Dec 29, 2024
@bjmsong
Copy link
Collaborator Author

bjmsong commented Dec 29, 2024

@merrymercy Run the test on 8xA100 40GB for three times, here's the result:

Round Before After
1st 0.752 0.750
2nd 0.750 0.749
3rd 0.750 0.751
python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tp 8 --enable-p2p-check --mem-fraction-static 0.8

python python/sglang/test/run_eval.py --host 127.0.0.1 --port 30000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --eval-name humaneval

@bjmsong
Copy link
Collaborator Author

bjmsong commented Jan 10, 2025

Hi, @merrymercy , do you have time to review this PR, thank you!

@merrymercy
Copy link
Contributor

merrymercy commented Jan 13, 2025

We probably won't merge this because

  1. This is a known bug confirmed by Revert "kernel: use tensor cores for flashinfer gqa kernels" #1511
  2. You can override the decision by using
    env_override = os.environ.get("SGLANG_FLASHINFER_USE_TENSOR_CORE")

Can you try to run the exact comment in #1511 again? It uses TP4 on A100 80GB.

# reproduce
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tp 4 --enable-p2p-check
python3 run_eval.py --host 127.0.0.1 --port 30000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --eval-name humaneval

@bjmsong
Copy link
Collaborator Author

bjmsong commented Jan 17, 2025

test on A100(80GB) * 4, no accuracy degradation now

humaneval
Before 0.750
After 0.750
# reproduce
python3 -m sglang.launch_server --model ${Meta-Llama-3.1-70B-Instruct} --tp 4 --enable-p2p-check

python3 python/sglang/test/run_eval.py --host 127.0.0.1 --port 30000 --model ${Meta-Llama-3.1-70B-Instruct} --eval-name humaneval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants