Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] The performance of v0.4.1 on AMD GPU is lower than v0.4.0 #2675

Open
5 tasks done
wyy007 opened this issue Dec 31, 2024 · 1 comment
Open
5 tasks done

[Bug] The performance of v0.4.1 on AMD GPU is lower than v0.4.0 #2675

wyy007 opened this issue Dec 31, 2024 · 1 comment
Assignees
Labels

Comments

@wyy007
Copy link

wyy007 commented Dec 31, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

We found that the performance test results on the latest sglang v0.4.1 version were lower than v0.4.0. The following are the test results
v0 4 0
v0 4 1

By comparing the results of pytorch profler, we found that The cost of the fwd_grouped_kernel_dage1 function has increased significantly.
pytorch_profiler

Reproduction

The service startup command is as follows
【v0.4.1】
NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=0 NVTE_FUSED_ATTN_AOTRITON=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON OPTIMIZE_EPILOGUE=1 HIP_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path /mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype auto --attention-backend triton --sampling-backend pytorch --grammar-backend outlines --trust-remote-code --schedule-conservativeness 0.3 --enable-torch-compile --quantization gptq
WARNING 12-31 03:33:38 rocm.py:31] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
[2024-12-31 03:33:48] server_args=ServerArgs(model_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization='gptq', context_length=None, device='cuda', served_model_name='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', chat_template=None, is_embedding=False, revision=None, return_token_ids=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=0.3, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=371532406, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=True, torch_compile_max_bs=32, cuda_graph_max_bs=8, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)

【v0.4.0】
NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=0 NVTE_FUSED_ATTN_AOTRITON=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON OPTIMIZE_EPILOGUE=1 HIP_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path /mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype auto --attention-backend triton --sampling-backend pytorch --grammar-backend outlines --trust-remote-code --schedule-conservativeness 0.3 --enable-torch-compile --quantization gptq
WARNING 12-31 04:03:00 rocm.py:31] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
[2024-12-31 04:03:09] server_args=ServerArgs(model_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization='gptq', context_length=None, device='cuda', served_model_name='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=0.3, cpu_offload_gb=0, tp_size=1, stream_interval=1, random_seed=556555260, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_torch_compile=True, torch_compile_max_bs=32, cuda_graph_max_bs=8, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)

We used a modified bench_deriving script to input longer sequences
bench

Environment

The environmental information is as follows
env

@zhyncs
Copy link
Member

zhyncs commented Dec 31, 2024

Hi @HaiShaw May you help take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants