Open
Description
Proposal to improve performance
I run a Llama3 8B inference benchmark on the MI300X with both V0 and V1 engines. I noticed that V1 is quite slower at decoding compared to V0. Normally, V1 is much faster than V0 on Nvidia.
One thing I noticed though is that, with V1, it doesn't print the Triton autotune output of the flash attn kernel, could be related to the attn implementation with V1.
Report of performance regression
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
==============================
PyTorch Info
==============================
PyTorch version : 2.8.0.dev20250615+rocm6.4
Is debug build : False
CUDA used to build PyTorch : N/A
ROCM used to build PyTorch : 6.4.43482-0f2d60242
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : AMD Instinct MI300X (gfx942:sramecc+:xnack-)
Nvidia driver version : Could not collect
cuDNN version : Could not collect
HIP runtime version : 6.4.43482
MIOpen runtime version : 3.4.0
Is XNNPACK available : True
==============================
vLLM Info
==============================
ROCM Version : 6.4.43483-a187df25c
Neuron SDK Version : N/A
vLLM Version : 0.9.2.dev95+g26bc46ef8.d20250616 (git sha: 26bc46ef8, date: 20250616)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.