[Performance]: V1 engine runs slower than V0 on the MI300X

### Proposal to improve performance

I run a Llama3 8B inference benchmark on the MI300X with both V0 and V1 engines. I noticed that V1 is quite slower at decoding compared to V0. Normally, V1 is much faster than V0 on Nvidia. 

One thing I noticed though is that, with V1, it doesn't print the Triton autotune output of the flash attn kernel, could be related to the attn implementation with V1.

### Report of performance regression

![Image](https://github.com/user-attachments/assets/84bdebe9-ff9d-486e-b60c-79c38588aa3e)

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
==============================
       PyTorch Info  
==============================
PyTorch version              : 2.8.0.dev20250615+rocm6.4
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 6.4.43482-0f2d60242

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : AMD Instinct MI300X (gfx942:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 6.4.43482
MIOpen runtime version       : 3.4.0
Is XNNPACK available         : True

==============================
         vLLM Info   
==============================
ROCM Version                 : 6.4.43483-a187df25c
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.2.dev95+g26bc46ef8.d20250616 (git sha: 26bc46ef8, date: 20250616)
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: V1 engine runs slower than V0 on the MI300X #19692

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: V1 engine runs slower than V0 on the MI300X #19692

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions