Skip to content

[Bug] Deterministic inference is not supported for Qwen3.5 next 400B FP8 #20509

@Mikechen-0105

Description

@Mikechen-0105

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

I launched a Qwen3.5 next 400B model with fp8 quantization, TP=8 setting, and run the
python3 -m sglang.test.test_deterministic --n-trials 50 --test-mode single test.
The results differ with batch size from 32 to 33. The output content differs in the middle

Image

Reproduction

The server is launched with
python -m sglang.launch_server --model-path <model_path> --tensor-parallel-size 8 --attention-backend flashinfer --quantization fp8 --enable-deterministic-inference --disable-cuda-graph
Test code directly call
python3 -m sglang.test.test_deterministic --n-trials 50 --test-mode single

Environment

Python: 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20-3e
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.133.20
PyTorch: 2.9.1+cu128
sglang: 0.5.8.post1
sgl_kernel: 0.3.21
flashinfer_python: 0.6.1
flashinfer_cubin: Module Not Found
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.4.2
aiohttp: 3.13.3
fastapi: 0.128.6
hf_transfer: 0.1.9
huggingface_hub: 0.36.2
interegular: 0.3.3
modelscope: 1.34.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.40.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.79.0
litellm: Module Not Found
decord2: 3.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PHB SYS SYS 0-79 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PXB PHB SYS SYS 0-79 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 PHB PIX SYS SYS 0-79 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 PHB PXB SYS SYS 0-79 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX PHB 80-159 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS PXB PHB 80-159 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS PHB PIX 80-159 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS PHB PXB 80-159 1 N/A
NIC0 PIX PXB PHB PHB SYS SYS SYS SYS X PHB SYS SYS
NIC1 PHB PHB PIX PXB SYS SYS SYS SYS PHB X SYS SYS
NIC2 SYS SYS SYS SYS PIX PXB PHB PHB SYS SYS X PHB
NIC3 SYS SYS SYS SYS PHB PHB PIX PXB SYS SYS PHB X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3

Hypervisor vendor:: KVM
ulimit soft: 102400

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions