Skip to content

[Issue] [vLLM]: MoE unit tests failure with AITER on #2153

@danichan-mkm

Description

@danichan-mkm

Problem Description

Running vllm unit test in tests/kernels/moe/test_routing.py::test_grouped_topk fails when using VLLM_ROCM_USE_AITER=1 due to mismatches between actual and baseline in topk_ids and topk_weights.

Issue 1: AITER biased_grouped_topk() kernel in /aiter/csrc/kernels/topk_softmax_kernels_group.cu hardcodes isSoftmax=false for all biased calls. Sigmoid is always applied to logits regardless of what scoring_func is called. This causes wrong expert weights -> mismatch fail for all softmax cases

Issue 2: Biased path computes group score using DPP cross lane reduction. Number of lane_steps is determined by THREAD_PER_GRP = warp_size / num_expert_group; where warp_size = 64. The kernel only implements reduction for THREAD_PER_GRP values of 2, 4, and 8. For any other value (num_expert_group=4) lane_steps falls to 0 and no cross-lane reduction. Each lane only sees a fraction of the experts in its group, so the group scores are computed from an incomplete subset.

Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

CPU

AMD EPYC 9655 96-Core Processor

GPU

MI350

ROCm Version

6.14.14

ROCm Component

No response

Steps to Reproduce

Install and build latest AITER and latest vLLM. Go to the vllm directory and use the following commands:

export VLLM_ROCM_USE_AITER=1
pytest tests/kernels/moe/test_routing.py::test_grouped_topk -v

Test will have occasional failure from mismatches

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions