Description
System Info
tensorrt 10.9.0.34
tensorrt-cu12 10.9.0.34
tensorrt_cu12_bindings 10.9.0.34
tensorrt_cu12_libs 10.9.0.34
tensorrt-llm 0.19.0
trtllm-bench --model Qwen/Qwen3-32B --model_path /root/.cache/modelscope/hub/models/Qwen/Qwen3-32B/ throughput --backend pytorch --max_batch_size 128 --max_num_tokens 16384 --dataset /root/dataset.txt --kv_cache_free_gpu_mem_fraction 0.9 --extra_llm_api_options /root/extra-llm-api-config.yml --concurrency 128 --num_requests 32768 --streaming
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[TensorRT-LLM] TensorRT-LLM version: 0.19.0
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[06/26/2025-14:53:30] [TRT-LLM] [I] Preparing to run throughput benchmark...
[06/26/2025-14:53:34] [TRT-LLM] [I]
= DATASET DETAILS
Dataset Path: /root/dataset.txt
Number of Sequences: 32768
-- Percentiles statistics ---------------------------------
Input Output Seq. Length
MIN: 1024.0000 1024.0000 2048.0000
MAX: 1024.0000 1024.0000 2048.0000
AVG: 1024.0000 1024.0000 2048.0000
P50: 1024.0000 1024.0000 2048.0000
P90: 1024.0000 1024.0000 2048.0000
P95: 1024.0000 1024.0000 2048.0000
P99: 1024.0000 1024.0000 2048.0000
[06/26/2025-14:53:34] [TRT-LLM] [I] Use user-provided max batch size and max num tokens.
[06/26/2025-14:53:34] [TRT-LLM] [I] Setting PyTorch max sequence length to 2048
[06/26/2025-14:53:34] [TRT-LLM] [I] Setting up throughput benchmark.
[06/26/2025-14:53:34] [TRT-LLM] [W] Overriding pytorch_backend_config because it's specified in /root/extra-llm-api-config.yml
[06/26/2025-14:53:34] [TRT-LLM] [W] Using default gpus_per_node: 8
[06/26/2025-14:53:34] [TRT-LLM] [I] Compute capability: (8, 9)
[06/26/2025-14:53:34] [TRT-LLM] [I] SM count: 128
[06/26/2025-14:53:34] [TRT-LLM] [I] SM clock: 3105 MHz
[06/26/2025-14:53:34] [TRT-LLM] [I] int4 TFLOPS: 813
[06/26/2025-14:53:34] [TRT-LLM] [I] int8 TFLOPS: 406
[06/26/2025-14:53:34] [TRT-LLM] [I] fp8 TFLOPS: 406
[06/26/2025-14:53:34] [TRT-LLM] [I] float16 TFLOPS: 203
[06/26/2025-14:53:34] [TRT-LLM] [I] bfloat16 TFLOPS: 203
[06/26/2025-14:53:34] [TRT-LLM] [I] float32 TFLOPS: 101
[06/26/2025-14:53:34] [TRT-LLM] [I] Total Memory: 23 GiB
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory clock: 10501 MHz
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory bus width: 384
[06/26/2025-14:53:34] [TRT-LLM] [I] Memory bandwidth: 1008 GB/s
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe link width: 16
[06/26/2025-14:53:34] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[06/26/2025-14:53:34] [TRT-LLM] [I] Set nccl_plugin to None.
[06/26/2025-14:53:34] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 32, 64, 128, 256, 384], cuda_graph_max_batch_size=0, cuda_graph_padding_enabled=True, enable_overlap_scheduler=True, moe_max_num_tokens=None, attn_backend='TRTLLM', mixed_decoder=False, enable_trtllm_decoder=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, print_iter_log=True, torch_compile_enabled=False, torch_compile_fullgraph=False, torch_compile_inductor_enabled=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/26/2025-14:53:34] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
Multiple distributions found for package optimum. Picked distribution: optimum
[TensorRT-LLM] TensorRT-LLM version: 0.19.0
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[06/26/2025-14:53:44] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[06/26/2025-14:53:44] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[06/26/2025-14:53:44] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 768, in _load_model
model = AutoModelForCausalLM.from_config(config)
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM
[06/26/2025-14:53:44] [TRT-LLM] [E] Failed to initialize executor on rank 0: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM
[06/26/2025-14:53:44] [TRT-LLM] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 768, in _load_model
model = AutoModelForCausalLM.from_config(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 24, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 623, in worker_main
worker: ExecutorBindingsWorker = worker_cls(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 119, in init
self.engine = _create_engine()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py", line 115, in _create_engine
return create_executor(executor_config=executor_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 69, in create_py_executor
model_engine = PyTorchModelEngine(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 258, in init
self.model = self._load_model(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 785, in _load_model
model = AutoModelForCausalLM.from_config(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 24, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM
Traceback (most recent call last):
File "/usr/local/bin/trtllm-bench", line 8, in
sys.exit(main())
File "/usr/lib/python3/dist-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 38, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/bench/benchmark/throughput.py", line 289, in throughput_command
llm = PyTorchLLM(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 27, in init
super().init(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 173, in init
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 168, in init
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 575, in _build_model
self._executor = self._executor_cls.create(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/executor.py", line 387, in create
return ExecutorBindingsProxy(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/proxy.py", line 100, in init
self._start_executor_workers(worker_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/proxy.py", line 319, in _start_executor_workers
raise ready_signal
ValueError: Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --local_dir /data/models/DeepSeek-R1-Distill-Qwen-32B
- cat >/root/extra-llm-api-config.yml <<EOF
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
enable_overlap_scheduler: true
EOF
- python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py
--tokenizer=/path/to/Qwen3-4B
--stdout token-norm-dist --num-requests=32768
--input-mean=1024 --output-mean=1024 \ --input-stdev=0 --output-stdev=0 > /root/dataset.txt - trtllm-bench --model Qwen/Qwen3-32B --model_path /root/.cache/modelscope/hub/models/Qwen/Qwen3-32B/ throughput --backend pytorch --max_batch_size 128 --max_num_tokens 16384 --dataset /root/dataset.txt --kv_cache_free_gpu_mem_fraction 0.9 --extra_llm_api_options /root/extra-llm-api-config.yml --concurrency 128 --num_requests 32768 --streaming
Expected behavior
Can run qwen3-32B successfully
actual behavior
report an exception as "Unknown architecture for AutoModelForCausalLM: Qwen3ForCausalLM"
additional notes
pass