Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Build errors for gfx90a (MI250) architecture #179

Open
OscarSavNS opened this issue Mar 5, 2025 · 4 comments
Open

[Issue]: Build errors for gfx90a (MI250) architecture #179

OscarSavNS opened this issue Mar 5, 2025 · 4 comments

Comments

@OscarSavNS
Copy link

Problem Description

In building SGLang for the gfx90a (MI250s) architecture, it fails due to Aiter, even if we target the gfx90a architecture. It seems due to inclusion of fp8 kernels in the build. Is there a flag I should be passing to disable all fp8, or some other set of arguments to allow the build to go forward for MI250s?

Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

CPU

AMD EPYC 7713 64-Core Processor

GPU

AMD Instinct MI250X/MI250 - amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

ROCm Version

ROCm 6.3.3

ROCm Component

No response

Steps to Reproduce

git clone [email protected]:sgl-project/sglang.git
cd sglang/docker

In the aiter (commit) build, in l.60 of the Dockerfile.rocm, replace

GPU_ARCHS=gfx942

with

GPU_ARCHS=gfx90a

The Dockerfile fails to build with command:

PREBUILD_KERNELS=1 GPU_ARCHS=gfx90a python3 setup.py develop

This is seemingly due to fp8 kernels being included which are not supported on gfx90a. Is there a flag I should be passing to disable all fp8, or to have it run on gfx90? It seems largely hardcoded in in some places (although admittedly the following example is for DeepSeek CSRC kernels):

"'-DENABLE_FP8'"

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

I have multiple GPUs on this node, but here is one of them:

ROCk module version 6.10.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========

...

*******
Agent 16
*******
  Name:                    gfx90a
  Uuid:                    GPU-124d13ccd7b050e5
  Marketing Name:          AMD Instinct MI250X/MI250
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    15
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29708(0x740c)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   37632
  Internal Node ID:        15
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

Full SGLang Dockerfile.rocm:

# Usage (to build SGLang ROCm docker image):
#   docker build --build-arg SGL_BRANCH=v0.4.3.post2 -t v0.4.3.post2-rocm630 -f Dockerfile.rocm .

# default base image
ARG BASE_IMAGE="rocm/vllm-dev:20250114"

FROM $BASE_IMAGE AS base
USER root

WORKDIR /sgl-workspace
ARG BUILD_TYPE=all
ARG SGL_REPO="https://github.com/sgl-project/sglang"
ENV SGL_DEFAULT="main"
ARG SGL_BRANCH=${SGL_DEFAULT}

ARG TRITON_REPO="https://github.com/ROCm/triton.git"
ARG TRITON_COMMIT="improve_fa_decode_3.0.0"


ARG AITER_REPO="https://github.com/ROCm/aiter.git"
ARG AITER_COMMIT="testx"

RUN git clone ${SGL_REPO} \
    && cd sglang \
    && if [ "${SGL_BRANCH}" = ${SGL_DEFAULT} ]; then \
         echo "Using ${SGL_DEFAULT}, default branch."; \
       else \
         echo "Using ${SGL_BRANCH} branch."; \
         git checkout ${SGL_BRANCH}; \
       fi \
    && cd sgl-kernel \
    && python setup_rocm.py install \
    && cd .. \
    && if [ "$BUILD_TYPE" = "srt" ]; then \
         python -m pip --no-cache-dir install -e "python[srt_hip]"; \
       else \
         python -m pip --no-cache-dir install -e "python[all_hip]"; \
       fi

RUN cp -r /sgl-workspace/sglang /sglang
RUN python -m pip cache purge

RUN pip install IPython \
    && pip install orjson \
    && pip install python-multipart \
    && pip install torchao \
    && pip install pybind11

RUN pip uninstall -y triton
RUN git clone ${TRITON_REPO} \
    && cd triton \
    && git checkout ${TRITON_COMMIT} \
    && cd python \
    && python3 setup.py install

RUN git clone ${AITER_REPO} \
    && cd aiter \
    && git checkout ${AITER_COMMIT} \
    && git submodule update --init --recursive \
    && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop

# Copy config files to support MI300X in virtualized environments (MI300X_VF).  Symlinks will not be created in image build.
RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
         /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
         -type f -name '*MI300X*' | xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}

# Performance environment variable.

ENV HIP_FORCE_DEV_KERNARG=1
ENV HSA_NO_SCRATCH_RECLAIM=1
ENV SGLANG_SET_CPU_AFFINITY=1
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
ENV NCCL_MIN_NCHANNELS=112

ENV MOE_PADDING=1
ENV VLLM_FP8_PADDING=1
ENV VLLM_FP8_ACT_PADDING=1
ENV VLLM_FP8_WEIGHT_PADDING=1
ENV VLLM_FP8_REDUCE_CONV=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1

CMD ["/bin/bash"]

@carlushuang
Copy link
Collaborator

@OscarSavNS thanks for this report. we are currently in rapid development period, more focusing on gfx94* serious. we are considering the strategy of supporting gfx90a, like you mentioned, just ignore fp8 or something

@OscarSavNS
Copy link
Author

OscarSavNS commented Mar 6, 2025

@carlushuang Thanks for the quick answer! For now I've commented out the Aiter import in our branch of SGLang, as it's currently (commit 98c73d7) used in 2 places that don't seem to apply for MI250s: fp8 kernels, and an unquantized fused moe kernel (which I think also has MI300 specific operations).

@WissamAntoun
Copy link

@OscarSavNS Thank you for the guide on how to build on MI250. Did you notice any speedups compared to using lmsysorg/sglang:v0.4.3.post2-rocm630 directly, which was built for MI300?

@OscarSavNS
Copy link
Author

@WissamAntoun I haven't tried it! I've been puled away to other stuff for a bit, but am currently focusing on having a SGLang version that works for MI250 and MI300. I'll try that out for MI300 though and report back, thanks the recommendation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants