Skip to content

TensorRT-LLM seems not support Qwen3-32B #5510

Open
@cunfate

Description

@cunfate

System Info

tensorrt 10.9.0.34
tensorrt-cu12 10.9.0.34
tensorrt_cu12_bindings 10.9.0.34
tensorrt_cu12_libs 10.9.0.34
tensorrt-llm 0.19.0

trtllm-bench --model Qwen/Qwen3-32B --model_path /root/.cache/modelscope/hub/models/Qwen/Qwen3-32B/ throughput --backend pytorch --max_batch_size 128 --max_num_tokens 16384 --dataset /root/dataset.txt --kv_cache_free_gpu_mem_fraction 0.9 --extra_llm_api_options /root/extra-llm-api-config.yml --concurrency 128 --num_requests 32768 --streaming
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[TensorRT-LLM] TensorRT-LLM version: 0.19.0
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[06/26/2025-14:53:30] [TRT-LLM] [I] Preparing to run throughput benchmark...
[06/26/2025-14:53:34] [TRT-LLM] [I]

= DATASET DETAILS

Dataset Path: /root/dataset.txt
Number of Sequences: 32768

-- Percentiles statistics ---------------------------------

    Input              Output           Seq. Length

MIN: 1024.0000 1024.0000 2048.0000
MAX: 1024.0000 1024.0000 2048.0000
AVG: 1024.0000 1024.0000 2048.0000
P50: 1024.0000 1024.0000 2048.0000
P90: 1024.0000 1024.0000 2048.0000
P95: 1024.0000 1024.0000 2048.0000
P99: 1024.0000 1024.0000 2048.0000

[06/26/2025-14:53:34] [TRT-LLM] [I] Use user-provided max batch size and max num tokens.
[06/26/2025-14:53:34] [TRT-LLM] [I] Setting PyTorch max sequence length to 2048
[06/26/2025-14:53:34] [TRT-LLM] [I] Setting up throughput benchmark.
[06/26/2025-14:53:34] [TRT-LLM] [W] Overriding pytorch_backend_config because it's specified in /root/extra-llm-api-config.yml
[06/26/2025-14:53:34] [TRT-LLM] [W] Using default gpus_per_node: 8
[06/26/2025-14:53:34] [TRT-LLM] [I] Compute capability: (8, 9)
[06/26/2025-14:53:34] [TRT-LLM] [I] SM count: 128
[06/26/2025-14:53:34] [TRT-LLM] [I] SM clock: 3105 MHz
[06/26/2025-14:53:34] [TRT-LLM] [I] int4 TFLOPS: 813
[06/26/2025-14:53:34] [TRT-LLM] [I] int8 TFLOPS: 406
[06/26/2025-14:53:34] [TRT-LLM] [I] fp8 TFLOPS: 406
[06/26/2025-14:53:34] [TRT-LLM] [I] float16 TFLOPS: 203
[06/26/2025-14:53:34] [TRT-LLM] [I] bfloat16 TFLOPS: 203
[06/26/2025-14:53:34] [TRT-LLM] [I] float32 TFLOPS: 101
[06/26/2025-14:53:34] [TRT-LLM] [I] Total Memory: 23 GiB
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory clock: 10501 MHz
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory bus width: 384
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory bandwidth: 1008 GB/s
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe link width: 16
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[06/26/2025-14:53:34] [TRT-LLM] [I] Set nccl_plugin to None.
[06/26/2025-14:53:34] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 32, 64, 128, 256, 384], cuda_graph_max_batch_size=0, cuda_graph_padding_enabled=True, enable_overlap_scheduler=True, moe_max_num_tokens=None, attn_backend='TRTLLM', mixed_decoder=False, enable_trtllm_decoder=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, print_iter_log=True, torch_compile_enabled=False, torch_compile_fullgraph=False, torch_compile_inductor_enabled=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
Multiple distributions found for package optimum. Picked distribution: optimum
[TensorRT-LLM] TensorRT-LLM version: 0.19.0
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[06/26/2025-14:53:44] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[06/26/2025-14:53:44] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[06/26/2025-14:53:44] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 768, in _load_model
model = AutoModelForCausalLM.from_config(config)
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM

[06/26/2025-14:53:44] [TRT-LLM] [E] Failed to initialize executor on rank 0: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM
[06/26/2025-14:53:44] [TRT-LLM] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 768, in _load_model
model = AutoModelForCausalLM.from_config(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 24, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 623, in worker_main
worker: ExecutorBindingsWorker = worker_cls(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 119, in init
self.engine = _create_engine()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 115, in _create_engine
return create_executor(executor_config=executor_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 69, in create_py_executor
model_engine = PyTorchModelEngine(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 258, in init
self.model = self._load_model(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 785, in _load_model
model = AutoModelForCausalLM.from_config(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 24, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM

Traceback (most recent call last):
File "/usr/local/bin/trtllm-bench", line 8, in
sys.exit(main())
File "/usr/lib/python3/dist-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 38, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/bench/benchmark/throughput.py", line 289, in throughput_command
llm = PyTorchLLM(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 27, in init
super().init(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 173, in init
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 168, in init
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 575, in _build_model
self._executor = self._executor_cls.create(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/executor.py", line 387, in create
return ExecutorBindingsProxy(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/proxy.py", line 100, in init
self._start_executor_workers(worker_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/proxy.py", line 319, in _start_executor_workers
raise ready_signal
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --local_dir /data/models/DeepSeek-R1-Distill-Qwen-32B
  2. cat >/root/extra-llm-api-config.yml <<EOF
    pytorch_backend_config:
    use_cuda_graph: true
    cuda_graph_padding_enabled: true
    cuda_graph_batch_sizes:
    • 1
    • 2
    • 4
    • 8
    • 16
    • 32
    • 64
    • 128
    • 256
    • 384
      print_iter_log: true
      enable_overlap_scheduler: true
      EOF
  3. python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py
    --tokenizer=/path/to/Qwen3-4B
    --stdout token-norm-dist --num-requests=32768
    --input-mean=1024 --output-mean=1024 \ --input-stdev=0 --output-stdev=0 > /root/dataset.txt
  4. trtllm-bench --model Qwen/Qwen3-32B --model_path /root/.cache/modelscope/hub/models/Qwen/Qwen3-32B/ throughput --backend pytorch --max_batch_size 128 --max_num_tokens 16384 --dataset /root/dataset.txt --kv_cache_free_gpu_mem_fraction 0.9 --extra_llm_api_options /root/extra-llm-api-config.yml --concurrency 128 --num_requests 32768 --streaming

Expected behavior

Can run qwen3-32B successfully

actual behavior

report an exception as "Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM"

additional notes

pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions