-
Notifications
You must be signed in to change notification settings - Fork 225
Description
Problem Description
Running vllm unit test in tests/kernels/moe/test_routing.py::test_grouped_topk fails when using VLLM_ROCM_USE_AITER=1 due to mismatches between actual and baseline in topk_ids and topk_weights.
Issue 1: AITER biased_grouped_topk() kernel in /aiter/csrc/kernels/topk_softmax_kernels_group.cu hardcodes isSoftmax=false for all biased calls. Sigmoid is always applied to logits regardless of what scoring_func is called. This causes wrong expert weights -> mismatch fail for all softmax cases
Issue 2: Biased path computes group score using DPP cross lane reduction. Number of lane_steps is determined by THREAD_PER_GRP = warp_size / num_expert_group; where warp_size = 64. The kernel only implements reduction for THREAD_PER_GRP values of 2, 4, and 8. For any other value (num_expert_group=4) lane_steps falls to 0 and no cross-lane reduction. Each lane only sees a fraction of the experts in its group, so the group scores are computed from an incomplete subset.
Operating System
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
CPU
AMD EPYC 9655 96-Core Processor
GPU
MI350
ROCm Version
6.14.14
ROCm Component
No response
Steps to Reproduce
Install and build latest AITER and latest vLLM. Go to the vllm directory and use the following commands:
export VLLM_ROCM_USE_AITER=1
pytest tests/kernels/moe/test_routing.py::test_grouped_topk -v
Test will have occasional failure from mismatches
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response