Description
System Info
-CPU: Intel Xeon Platinum 8352V (144) @ 3.500GHz X86
-Memory: 1031689MiB
-GPU:RTX-4090*8
-Librarys
tensorrt 10.7.0
tensorrt_cu12 10.7.0
tensorrt-cu12-bindings 10.7.0
tensorrt-cu12-libs 10.7.0
tensorrt-llm 0.16.0
nvidia driver version
Driver Version: 550.135 CUDA Version: 12.4
OS Ubuntu 22.04.5 LTS x86_64
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I ran the trtllm-serve comman like:
trtllm-serve /home/lz/tensorrt/build/Qwen2.5-7B-Instructtrt_engines/weight_only/1-gpu
--tokenizer /home/lz/tensorrt/models/Qwen2.5-7B-Instruct
--max_batch_size 128 --max_num_tokens 4096 --max_seq_len 4096
--kv_cache_free_gpu_memory_fraction 0.95
But there is no output except the:
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
No errors,no warnings no Port occupation
But it ran well with the test:
python3 /home/lz/TensorRT-LLM/examples/run.py --input_text "你好,请问你叫什么?"
--max_output_len=50
--tokenizer_dir /home/lz/tensorrt/models/Qwen2.5-7B-Instruct
--engine_dir=/home/lz/tensorrt/build/Qwen2.5-7B-Instructtrt_engines/weight_only/1-gpu
What Can I do to run an OpenAI API compatible server
Expected behavior
Does it should output somemore info?
actual behavior
Nnothing but version
additional notes
Is that a problem with Qwen2.5-7b
I 'd appreciate if you guys could give me some help