improve performance by removing use_tensor_core dependency #2496

bjmsong · 2024-12-17T05:38:21Z

Motivation

According to the discussion of this PR

Modifications

flasinfer<=0.1.6

kv_cache_dtype	gqa_group_size	before	after
torch.float16, torch.half, torch.bfloat16	8	use_tensor_cores=False	use_tensor_cores=True
torch.float8_e4m3fn, torch.float8_e5m2	[1,2,4,8]	use_tensor_cores=False	use_tensor_cores=True

E2E Test

A100(40GB)

python -m sglang.launch_server --model-path {Yi-1.5-9B-Chat}

python -m sglang.bench_serving --backend sglang --model {Yi-1.5-9B-Chat} --dataset-name random --random-output-len 4096 --num-prompts 32

Throughput: 931 toks/s -> 1171 toks/s
Median ITL: 20.56 ms -> 18.97 ms

# before
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  87.96     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    54217     
Request throughput (req/s):              0.36      
Input token throughput (tok/s):          194.49    
Output token throughput (tok/s):         736.98    
Total token throughput (tok/s):          931.47    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   46408.83  
Median E2E Latency (ms):                 39137.18  
---------------Time to First Token----------------
Mean TTFT (ms):                          1150.39   
Median TTFT (ms):                        1517.29   
P99 TTFT (ms):                           1581.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.67     
Median TPOT (ms):                        20.82     
P99 TPOT (ms):                           24.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.70     
Median ITL (ms):                         20.56     
P99 ITL (ms):                            39.10     
==================================================

# after
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  69.94     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    57593     
Request throughput (req/s):              0.46      
Input token throughput (tok/s):          244.62    
Output token throughput (tok/s):         926.93    
Total token throughput (tok/s):          1171.56   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39461.97  
Median E2E Latency (ms):                 37171.11  
---------------Time to First Token----------------
Mean TTFT (ms):                          1157.81   
Median TTFT (ms):                        1529.05   
P99 TTFT (ms):                           1592.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.20     
Median TPOT (ms):                        19.14     
P99 TPOT (ms):                           21.97     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.21     
Median ITL (ms):                         18.97     
P99 ITL (ms):                            22.05     
==================================================

python -m sglang.launch_server --model-path {Meta-Llama-3.1-8B-Instruct-FP8} --kv-cache-dtype fp8_e5m2

python -m sglang.bench_serving --backend sglang --model {Meta-Llama-3.1-8B-Instruct-FP8} --dataset-name random --random-output-len 4096 --num-prompts 32

Throughput: 1410 toks/s -> 1735 toks/s
Median ITL: 14.79 ms -> 11.89 ms

# before
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  58.09     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    64747     
Request throughput (req/s):              0.55      
Input token throughput (tok/s):          294.51    
Output token throughput (tok/s):         1115.97   
Total token throughput (tok/s):          1410.49   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   31550.18  
Median E2E Latency (ms):                 30155.63  
---------------Time to First Token----------------
Mean TTFT (ms):                          1366.20   
Median TTFT (ms):                        1652.97   
P99 TTFT (ms):                           1671.34   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.98     
Median TPOT (ms):                        15.06     
P99 TPOT (ms):                           15.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.91     
Median ITL (ms):                         14.79     
P99 ITL (ms):                            17.01     
==================================================

# after
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     32        
Benchmark duration (s):                  47.21     
Total input tokens:                      17108     
Total generated tokens:                  64826     
Total generated tokens (retokenized):    64799     
Request throughput (req/s):              0.68      
Input token throughput (tok/s):          362.41    
Output token throughput (tok/s):         1373.26   
Total token throughput (tok/s):          1735.67   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25458.49  
Median E2E Latency (ms):                 24172.10  
---------------Time to First Token----------------
Mean TTFT (ms):                          1374.96   
Median TTFT (ms):                        1651.89   
P99 TTFT (ms):                           1669.89   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.07     
Median TPOT (ms):                        12.01     
P99 TPOT (ms):                           13.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.89     
Median ITL (ms):                         11.89     
P99 ITL (ms):                            13.27     
==================================================

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

…d_size_compiled_for_decode_kernels

merrymercy · 2024-12-17T11:29:05Z

Previously, we found an accuracy regression if we remove this. Did you test the accuracy of all models?

bjmsong · 2024-12-18T04:05:32Z

result of test_eval_accuracy_large.py

Model	kv-cache-dtype	eval_name	Before	After
Yi-1.5-9B-Chat	auto	mmlu	0.641	0.645
		humaneval	0.616	0.615
		mgsm_en	0.868	0.880
Meta-Llama-3.1-8B-Instruct-FP8	fp8_e5m2	mmlu	0.706	0.707
		humaneval	0.666	0.660
		mgsm_en	0.856	0.852

merrymercy · 2024-12-26T15:03:27Z

What is your hardware?

bjmsong · 2024-12-27T00:07:06Z

A100(40GB)

merrymercy · 2024-12-27T01:56:12Z

Can you test the command in this PR #1511?

If it passes, we can merge this.

bjmsong · 2024-12-27T09:29:24Z

python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tp 4 --enable-p2p-check

python python/sglang/test/run_eval.py --host 127.0.0.1 --port 30000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --eval-name humaneval

Run the test on L20*4 twice, here's the result:

Round	Before	After
First	0.751	0.750
Second	0.748	0.750

merrymercy · 2024-12-27T19:03:36Z

Thanks for testing this! The previous bug only happens on certain hardware. We reproduced it once on 8xA100 40GB. That is why I am asking which hardware you are using. Is it possible for you to run it again on 8xA100 40GB with the exact command?

Especially @yzh119 also acknowledged it is a bug so we need to very careful here.

bjmsong · 2024-12-29T15:10:53Z

@merrymercy Run the test on 8xA100 40GB for three times, here's the result:

Round	Before	After
1st	0.752	0.750
2nd	0.750	0.749
3rd	0.750	0.751

python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tp 8 --enable-p2p-check --mem-fraction-static 0.8

python python/sglang/test/run_eval.py --host 127.0.0.1 --port 30000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --eval-name humaneval

bjmsong · 2025-01-10T08:06:10Z

Hi, @merrymercy , do you have time to review this PR, thank you!

merrymercy · 2025-01-13T06:15:20Z

We probably won't merge this because

This is a known bug confirmed by Revert "kernel: use tensor cores for flashinfer gqa kernels" #1511
You can override the decision by using

sglang/python/sglang/srt/layers/attention/flashinfer_backend.py

Line 904 in 0bb0f76

env_override = os.environ.get("SGLANG_FLASHINFER_USE_TENSOR_CORE")

Can you try to run the exact comment in #1511 again? It uses TP4 on A100 80GB.

# reproduce
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tp 4 --enable-p2p-check
python3 run_eval.py --host 127.0.0.1 --port 30000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --eval-name humaneval

bjmsong · 2025-01-17T03:06:58Z

test on A100(80GB) * 4, no accuracy degradation now

	humaneval
Before	0.750
After	0.750

# reproduce
python3 -m sglang.launch_server --model ${Meta-Llama-3.1-70B-Instruct} --tp 4 --enable-p2p-check

python3 python/sglang/test/run_eval.py --host 127.0.0.1 --port 30000 --model ${Meta-Llama-3.1-70B-Instruct} --eval-name humaneval

improve performance by removing use_tensor_core dependency of _groupe…

afec6c6

…d_size_compiled_for_decode_kernels

bjmsong requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners December 17, 2024 05:38

merrymercy added the await-response label Dec 17, 2024

merrymercy self-assigned this Dec 29, 2024

bjmsong and others added 2 commits January 10, 2025 15:33

Merge branch 'main' into chunked

3b64d5d

reslove conflix

31b64bc

bjmsong mentioned this pull request Jan 16, 2025

support e4m3 kvcache in qwen2 & add kv scaling facotr json #2894

Merged

3 tasks

bjmsong force-pushed the chunked branch from b093223 to 8e65e9b Compare January 17, 2025 02:19

Merge branch 'main' into chunked

4da7f8a

bjmsong force-pushed the chunked branch from 8e65e9b to 4da7f8a Compare January 17, 2025 02:47

bjmsong added 3 commits January 17, 2025 13:34

Merge branch 'main' into chunked

5785dea

Merge branch 'main' into chunked

1b26a81

Merge branch 'main' into chunked

526cb8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance by removing use_tensor_core dependency #2496

improve performance by removing use_tensor_core dependency #2496

bjmsong commented Dec 17, 2024

merrymercy commented Dec 17, 2024

bjmsong commented Dec 18, 2024 •

edited

Loading

merrymercy commented Dec 26, 2024

bjmsong commented Dec 27, 2024

merrymercy commented Dec 27, 2024

bjmsong commented Dec 27, 2024

merrymercy commented Dec 27, 2024

bjmsong commented Dec 29, 2024 •

edited

Loading

bjmsong commented Jan 10, 2025

merrymercy commented Jan 13, 2025 •

edited

Loading

bjmsong commented Jan 17, 2025

improve performance by removing use_tensor_core dependency #2496

Are you sure you want to change the base?

improve performance by removing use_tensor_core dependency #2496

Conversation

bjmsong commented Dec 17, 2024

Motivation

Modifications

E2E Test

Checklist

merrymercy commented Dec 17, 2024

bjmsong commented Dec 18, 2024 • edited Loading

merrymercy commented Dec 26, 2024

bjmsong commented Dec 27, 2024

merrymercy commented Dec 27, 2024

bjmsong commented Dec 27, 2024

merrymercy commented Dec 27, 2024

bjmsong commented Dec 29, 2024 • edited Loading

bjmsong commented Jan 10, 2025

merrymercy commented Jan 13, 2025 • edited Loading

bjmsong commented Jan 17, 2025

bjmsong commented Dec 18, 2024 •

edited

Loading

bjmsong commented Dec 29, 2024 •

edited

Loading

merrymercy commented Jan 13, 2025 •

edited

Loading