[Issue] [vLLM]: MoE unit tests failure with AITER on

### Problem Description

Running vllm unit test in tests/kernels/moe/test_routing.py::test_grouped_topk fails when using VLLM_ROCM_USE_AITER=1 due to mismatches between actual and baseline in topk_ids and topk_weights.

Issue 1: AITER biased_grouped_topk() kernel in /aiter/csrc/kernels/topk_softmax_kernels_group.cu hardcodes isSoftmax=false for all biased calls. Sigmoid is always applied to logits regardless of what scoring_func is called. This causes wrong expert weights -> mismatch fail for all softmax cases

Issue 2: Biased path computes group score using DPP cross lane reduction. Number of lane_steps is determined by THREAD_PER_GRP = warp_size / num_expert_group; where warp_size = 64. The kernel only implements reduction for THREAD_PER_GRP values of 2, 4, and 8. For any other value (num_expert_group=4) lane_steps falls to 0 and no cross-lane reduction. Each lane only sees a fraction of the experts in its group, so the group scores are computed from an incomplete subset.

### Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

### CPU

 AMD EPYC 9655 96-Core Processor

### GPU

MI350

### ROCm Version

6.14.14

### ROCm Component

_No response_

### Steps to Reproduce

Install and build latest AITER and latest vLLM. Go to the vllm directory and use the following commands: 

export VLLM_ROCM_USE_AITER=1
pytest tests/kernels/moe/test_routing.py::test_grouped_topk -v

Test will have occasional failure from mismatches

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue] [vLLM]: MoE unit tests failure with AITER on #2153

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue] [vLLM]: MoE unit tests failure with AITER on #2153

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions