You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Checklist
Describe the bug
We found that the performance test results on the latest sglang v0.4.1 version were lower than v0.4.0. The following are the test results
By comparing the results of pytorch profler, we found that The cost of the fwd_grouped_kernel_dage1 function has increased significantly.
Reproduction
The service startup command is as follows
【v0.4.1】
NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=0 NVTE_FUSED_ATTN_AOTRITON=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON OPTIMIZE_EPILOGUE=1 HIP_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path /mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype auto --attention-backend triton --sampling-backend pytorch --grammar-backend outlines --trust-remote-code --schedule-conservativeness 0.3 --enable-torch-compile --quantization gptq
WARNING 12-31 03:33:38 rocm.py:31]
fork
method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden tospawn
instead.[2024-12-31 03:33:48] server_args=ServerArgs(model_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization='gptq', context_length=None, device='cuda', served_model_name='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', chat_template=None, is_embedding=False, revision=None, return_token_ids=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=0.3, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=371532406, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=True, torch_compile_max_bs=32, cuda_graph_max_bs=8, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
【v0.4.0】
NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=0 NVTE_FUSED_ATTN_AOTRITON=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON OPTIMIZE_EPILOGUE=1 HIP_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path /mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype auto --attention-backend triton --sampling-backend pytorch --grammar-backend outlines --trust-remote-code --schedule-conservativeness 0.3 --enable-torch-compile --quantization gptq
WARNING 12-31 04:03:00 rocm.py:31]
fork
method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden tospawn
instead.[2024-12-31 04:03:09] server_args=ServerArgs(model_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_path='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization='gptq', context_length=None, device='cuda', served_model_name='/mnt/md0/pkg/Qwen2.5-7B-Instruct-GPTQ-Int8/', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=0.3, cpu_offload_gb=0, tp_size=1, stream_interval=1, random_seed=556555260, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_torch_compile=True, torch_compile_max_bs=32, cuda_graph_max_bs=8, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
We used a modified bench_deriving script to input longer sequences
Environment
The environmental information is as follows
The text was updated successfully, but these errors were encountered: