-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature, Hardware] Enable DeepseekV3 on AMD GPUs #2601
Conversation
@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first? |
That would be nice. I plan to release v0.4.1.post1 soon to enable users to use AMD MI300X initially. |
Hi @BruceXcluding May you help fix the failed CIs ref https://github.com/sgl-project/sglang/blob/main/docs/references/contributor_guide.md#format-your-code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceXcluding
Some to address, thanks!
docker/Dockerfile.rocm
Outdated
@@ -37,7 +37,7 @@ ENV SGLANG_SET_CPU_AFFINITY=1 | |||
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 | |||
ENV NCCL_MIN_NCHANNELS=112 | |||
|
|||
ENV MOE_PADDING=1 | |||
ENV MOE_PADDING=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to keep MOE_PADDING on for performance, to error it incurs we need to fix it.
@@ -402,7 +402,7 @@ def _decode_grouped_att_m_fwd( | |||
sm_scale, | |||
logit_cap, | |||
): | |||
BLOCK = 32 | |||
BLOCK = 16 if is_hip() else 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not cut by half for HIP globally here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't work well in latest vllm with BLOCK 32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part we can not take as it is - it will cost performance of all other models in large margin.
@@ -217,7 +217,7 @@ def create_weights( | |||
|
|||
# WEIGHT | |||
weight_dtype = ( | |||
torch.float8_e4m3fn | |||
torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not have this, serialized weight is always OCP (torch.float8_e4m3fn)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would encounter the error "python/sglang/srt/layers/quantization/fp8_kernel.py:176:33: error: Unsupported conversion from 'f8E4M3FN' to 'f16'
accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]" with torch.float8_e4m3fn
at w8a8_block_fp8_matmul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check how normalize_e4m3fn_to_e4m3fnuz
is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.
@@ -430,7 +432,7 @@ def get_default_config( | |||
dtype: Optional[str], | |||
is_marlin: bool, | |||
) -> Dict[str, int]: | |||
if dtype == "fp8_w8a8": | |||
if dtype == "fp8_w8a8" and not is_hip(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following block isn't a breaker to HIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceXcluding
Also see this error below with your version of pyproject.toml
:
File "/dockerx/1226/HS/sglang/python/sglang/srt/constrained/outlines_backend.py", line 23, in <module>
from outlines.fsm.json_schema import build_regex_from_schema
ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema' (/usr/local/lib/python3.12/dist-packages/outlines/fsm/json_schema.py)
the CI failure: PR Test / unit-test-backend-2-gpu, used a lite model 'deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', which doesn't has fp8 block-level quant feature |
@@ -217,7 +217,7 @@ def create_weights( | |||
|
|||
# WEIGHT | |||
weight_dtype = ( | |||
torch.float8_e4m3fn | |||
torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check how normalize_e4m3fn_to_e4m3fnuz
is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.
@@ -432,7 +432,7 @@ def create_weights( | |||
from sglang.srt.layers.moe.fused_moe_triton import FusedMoeWeightScaleSupported | |||
|
|||
if self.quant_config.is_checkpoint_fp8_serialized: | |||
params_dtype = torch.float8_e4m3fn | |||
params_dtype = torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same problem here - check out the previous usage from normalize_e4m3fn_to_e4m3fnuz
sgl-kernel/setup.py
Outdated
|
||
def is_hip() -> bool: | ||
"""Return whether it is HIP on the AMD ROCm platform.""" | ||
return torch.version.hip is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return torch.version.hip is not None | |
return torch.cuda.is_available() and torch.version.hip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyproject.toml
Outdated
@@ -27,7 +27,7 @@ srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cu | |||
|
|||
# HIP (Heterogeneous-computing Interface for Portability) for AMD | |||
# => base docker rocm/vllm-dev:20241022, not from public vllm whl | |||
srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"] | |||
srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.post2.dev1+g1ef171e0.rocm624"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What issues could occur if the image isn't updated? Minimize updating the base image whenever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhyncs we (AMD) will have to decide on this, so ignore it for now.
sgl-kernel/setup.py
Outdated
"sgl_kernel.ops.moe_align_block_size", | ||
[ | ||
"src/sgl-kernel/csrc/moe_align_kernel.cu", | ||
"src/sgl-kernel/csrc/sgl_kernel_ops.cu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to use AMD for compilation, I recommend not compiling sgl_kernel_ops.cu
directly. Instead, use a separate file to avoid mixing NVIDIA and AMD's cu files, it's better to keep them separate. cc @HaiShaw @ispobock @merrymercy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions? @yzh119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems need to compile reduce kernel here, otherwise some archs will not be imported due to No module named 'sgl_kernel.ops._kernels'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we use is_hip there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhyncs In case CUDA/HIP compatible kernel files, we don't use separate files (that is point of HIP), I believe that is one of the cases. We do for sure separate files for AMD specific kernels or kernel implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zyeric the else:
case seemingly have no impact to NV side, can you be more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better to separate amd/nv kernels as 2 different backends? at this moment, moe_align_kernel is only required for amd backend, while in near future, there are ck kernels added to amd backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaiShaw I think the root cause is that the import path is still sgl_kernel.ops._kernels
at https://github.com/BruceXcluding/sglang/blob/main/sgl-kernel/src/sgl-kernel/ops/__init__.py#L1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current version works for me, many thanks :D
Accuracy: 0.951
Invalid: 0.000
Latency: 160.916 s
Output throughput: 869.145 token/s
@AdjectiveAllison Can you try with the latest instruction |
python/sglang/srt/server.py
Outdated
@@ -578,8 +578,9 @@ def _set_envs_and_config(server_args: ServerArgs): | |||
os.environ["NCCL_NVLS_ENABLE"] = "0" | |||
os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1" | |||
os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4" | |||
if "GLOO_SOCKET_IFNAME" not in os.environ: | |||
os.environ["GLOO_SOCKET_IFNAME"] = "eth0" | |||
# TODO(fix socket error with gpu backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used for cpu backend or specific workstation? get RuntimeError: [enforce fail at pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for multi-node tensor parallelism. Instead of using comments, we suggest adding an is_hip flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the value set for the GLOO_SOCKET_IFNAME environment variable should depend on the name of the network interface card in each user's system and should not be hard-coded as eth0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wufann If the user's value is not eth0, they should specify it explicitly, this applies only when no setting is provided, with eth0 as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhyncs Different network interface ( "ens" ) may be used. Also they may test in a single node envrionment where IB is not configured. In that case IB should be disabled
sgl-kernel/amd/CMakeLists.txt
Outdated
@@ -0,0 +1,51 @@ | |||
cmake_minimum_required(VERSION 3.18) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this, we only use CMakeLists.txt for clangd indexing, so it's not necessary.
sgl-kernel/amd/pyproject.toml
Outdated
build-backend = "setuptools.build_meta" | ||
|
||
[project] | ||
name = "sgl-kernel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refer to the setup of flash-attention or vllm compatible with NVIDIA and AMD?
https://github.com/Dao-AILab/flash-attention/blob/main/setup.py
https://github.com/vllm-project/vllm/blob/main/setup.py
Hi @BruceXcluding @HaiShaw |
Tested and works well. We could build sgl-kernel-amd after we add ck kernels |
@BruceXcluding, How was the performance comparing to |
Hi @BruceXcluding @HaiShaw |
@zhyncs I am expecting @BruceXcluding to do the final update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceXcluding thanks!
Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement. |
Co-authored-by: root <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: Bruce Xue <[email protected]> Co-authored-by: Yineng Zhang <[email protected]>
Yes theoretical throughput is
There are spaces to improve. |
Motivation
Modifications
e4m3fnuz
to support DeepseekV3 FP8 modelFlashInfer backend bmm_fp8
to cast FP8 to BF16 in MLAtriton stages
configTODO
How to run
build env
offline:
server:
Issues
raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
, same with [Bug] Deepseek-v2-lite AMD MI300 run failed #2384Solved with
python/sglang/srt/layers/attention/triton_ops/decode_attention.py +410
ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema'
, same with [Bug] SGLang v0.4.0 with AMD MI300X #2530Solved with
downgrade vllm
Solved with ifconfig check your eth number and
export GLOO_SOCKET_IFNAME=your eth
Checklist