TensorRT-LLM seems not support Qwen3-32B

### System Info

tensorrt                 10.9.0.34
tensorrt-cu12            10.9.0.34
tensorrt_cu12_bindings   10.9.0.34
tensorrt_cu12_libs       10.9.0.34
tensorrt-llm             0.19.0

trtllm-bench       --model Qwen/Qwen3-32B       --model_path /root/.cache/modelscope/hub/models/Qwen/Qwen3-32B/      throughput       --backend pytorch       --max_batch_size 128       --max_num_tokens 16384       --dataset /root/dataset.txt       --kv_cache_free_gpu_mem_fraction 0.9       --extra_llm_api_options /root/extra-llm-api-config.yml       --concurrency 128       --num_requests 32768       --streaming
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[TensorRT-LLM] TensorRT-LLM version: 0.19.0
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[06/26/2025-14:53:30] [TRT-LLM] [I] Preparing to run throughput benchmark...
[06/26/2025-14:53:34] [TRT-LLM] [I]
===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /root/dataset.txt
Number of Sequences:  32768

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:  1024.0000          1024.0000          2048.0000
MAX:  1024.0000          1024.0000          2048.0000
AVG:  1024.0000          1024.0000          2048.0000
P50:  1024.0000          1024.0000          2048.0000
P90:  1024.0000          1024.0000          2048.0000
P95:  1024.0000          1024.0000          2048.0000
P99:  1024.0000          1024.0000          2048.0000
===========================================================

[06/26/2025-14:53:34] [TRT-LLM] [I] Use user-provided max batch size and max num tokens.
[06/26/2025-14:53:34] [TRT-LLM] [I] Setting PyTorch max sequence length to 2048
[06/26/2025-14:53:34] [TRT-LLM] [I] Setting up throughput benchmark.
[06/26/2025-14:53:34] [TRT-LLM] [W] Overriding pytorch_backend_config because it's specified in /root/extra-llm-api-config.yml
[06/26/2025-14:53:34] [TRT-LLM] [W] Using default gpus_per_node: 8
[06/26/2025-14:53:34] [TRT-LLM] [I] Compute capability: (8, 9)
[06/26/2025-14:53:34] [TRT-LLM] [I] SM count: 128
[06/26/2025-14:53:34] [TRT-LLM] [I] SM clock: 3105 MHz
[06/26/2025-14:53:34] [TRT-LLM] [I] int4 TFLOPS: 813
[06/26/2025-14:53:34] [TRT-LLM] [I] int8 TFLOPS: 406
[06/26/2025-14:53:34] [TRT-LLM] [I] fp8 TFLOPS: 406
[06/26/2025-14:53:34] [TRT-LLM] [I] float16 TFLOPS: 203
[06/26/2025-14:53:34] [TRT-LLM] [I] bfloat16 TFLOPS: 203
[06/26/2025-14:53:34] [TRT-LLM] [I] float32 TFLOPS: 101
[06/26/2025-14:53:34] [TRT-LLM] [I] Total Memory: 23 GiB
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory clock: 10501 MHz
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory bus width: 384
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory bandwidth: 1008 GB/s
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe link width: 16
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[06/26/2025-14:53:34] [TRT-LLM] [I] Set nccl_plugin to None.
[06/26/2025-14:53:34] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 32, 64, 128, 256, 384], cuda_graph_max_batch_size=0, cuda_graph_padding_enabled=True, enable_overlap_scheduler=True, moe_max_num_tokens=None, attn_backend='TRTLLM', mixed_decoder=False, enable_trtllm_decoder=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, print_iter_log=True, torch_compile_enabled=False, torch_compile_fullgraph=False, torch_compile_inductor_enabled=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
Multiple distributions found for package optimum. Picked distribution: optimum
[TensorRT-LLM] TensorRT-LLM version: 0.19.0
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[06/26/2025-14:53:44] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[06/26/2025-14:53:44] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[06/26/2025-14:53:44] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 768, in _load_model
    model = AutoModelForCausalLM.from_config(config)
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM


[06/26/2025-14:53:44] [TRT-LLM] [E] Failed to initialize executor on rank 0: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM
[06/26/2025-14:53:44] [TRT-LLM] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 768, in _load_model
    model = AutoModelForCausalLM.from_config(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 24, in from_config
    raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 623, in worker_main
    worker: ExecutorBindingsWorker = worker_cls(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 119, in __init__
    self.engine = _create_engine()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 115, in _create_engine
    return create_executor(executor_config=executor_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 69, in create_py_executor
    model_engine = PyTorchModelEngine(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 258, in __init__
    self.model = self._load_model(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 785, in _load_model
    model = AutoModelForCausalLM.from_config(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 24, in from_config
    raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-bench", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 38, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/bench/benchmark/throughput.py", line 289, in throughput_command
    llm = PyTorchLLM(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 27, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 173, in __init__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 168, in __init__
    self._build_model()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 575, in _build_model
    self._executor = self._executor_cls.create(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/executor.py", line 387, in create
    return ExecutorBindingsProxy(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/proxy.py", line 100, in __init__
    self._start_executor_workers(worker_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/proxy.py", line 319, in _start_executor_workers
    raise ready_signal
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B  --local_dir /data/models/DeepSeek-R1-Distill-Qwen-32B
2. cat >/root/extra-llm-api-config.yml <<EOF
pytorch_backend_config:
    use_cuda_graph: true
    cuda_graph_padding_enabled: true
    cuda_graph_batch_sizes:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 128
    - 256
    - 384
    print_iter_log: true
    enable_overlap_scheduler: true
EOF
3. python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
    --tokenizer=/path/to/Qwen3-4B \
    --stdout token-norm-dist --num-requests=32768 \
    --input-mean=1024 --output-mean=1024 \    --input-stdev=0 --output-stdev=0 > /root/dataset.txt
4. trtllm-bench       --model Qwen/Qwen3-32B       --model_path /root/.cache/modelscope/hub/models/Qwen/Qwen3-32B/      throughput       --backend pytorch       --max_batch_size 128       --max_num_tokens 16384       --dataset /root/dataset.txt       --kv_cache_free_gpu_mem_fraction 0.9       --extra_llm_api_options /root/extra-llm-api-config.yml       --concurrency 128       --num_requests 32768       --streaming

### Expected behavior

Can run qwen3-32B successfully

### actual behavior

report an exception as "Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM"

### additional notes

pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorRT-LLM seems not support Qwen3-32B #5510

System Info

= DATASET DETAILS

MIN: 1024.0000 1024.0000 2048.0000
MAX: 1024.0000 1024.0000 2048.0000
AVG: 1024.0000 1024.0000 2048.0000
P50: 1024.0000 1024.0000 2048.0000
P90: 1024.0000 1024.0000 2048.0000
P95: 1024.0000 1024.0000 2048.0000
P99: 1024.0000 1024.0000 2048.0000

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorRT-LLM seems not support Qwen3-32B #5510

Description

System Info

= DATASET DETAILS

MIN: 1024.0000 1024.0000 2048.0000 MAX: 1024.0000 1024.0000 2048.0000 AVG: 1024.0000 1024.0000 2048.0000 P50: 1024.0000 1024.0000 2048.0000 P90: 1024.0000 1024.0000 2048.0000 P95: 1024.0000 1024.0000 2048.0000 P99: 1024.0000 1024.0000 2048.0000

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

MIN: 1024.0000 1024.0000 2048.0000
MAX: 1024.0000 1024.0000 2048.0000
AVG: 1024.0000 1024.0000 2048.0000
P50: 1024.0000 1024.0000 2048.0000
P90: 1024.0000 1024.0000 2048.0000
P95: 1024.0000 1024.0000 2048.0000
P99: 1024.0000 1024.0000 2048.0000