Description
System Info
cpu intel 14700k
gpu rtx 4090
tensorrt_llm 0.13
docker tritonserver:24.09-trtllm-python-py3
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
reference: python/openai
Expected behavior
When I run using openai\openai_frontend\main.py and specify the 8-bit quantized ChatGLM4 model with a size of 9.95G, it is expected to be around 12G. However, during inference, the entire 24G of GPU memory is filled up.
actual behavior
root@docker-desktop:/llm/openai# python3 /llm/openai/openai_frontend/main.py --backend tensorrtllm --model-repository /llm/tensorrt_llm/model_repo --tokenizer /llm/tensorrt_llm/tokenizer_dir
I1019 19:28:29.074797 1272 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204c00000' with size 268435456"
I1019 19:28:29.074857 1272 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1019 19:28:29.173311 1272 model_lifecycle.cc:472] "loading: preprocessing:1"
I1019 19:28:29.177154 1272 model_lifecycle.cc:472] "loading: postprocessing:1"
I1019 19:28:29.180251 1272 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I1019 19:28:29.183750 1272 model_lifecycle.cc:472] "loading: tensorrt_llm_bls:1"
I1019 19:28:29.413431 1272 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1019 19:28:29.413459 1272 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1019 19:28:29.413473 1272 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1019 19:28:29.413485 1272 libtensorrtllm.cc:86] "backend configuration:\n{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I1019 19:28:29.416281 1272 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to true
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 40
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2048
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I1019 19:28:31.814801 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I1019 19:28:31.890846 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1019 19:28:31.891695 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1019 19:28:32.878905 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I1019 19:28:33.534115 1272 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
I1019 19:28:33.534477 1272 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 10194 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 196.76 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10184 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 648.06 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 12.22 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4506
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.00 GiB for max tokens in paged KV cache (288384).
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I1019 19:28:53.616560 1272 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I1019 19:28:53.616756 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'"
I1019 19:28:53.619599 1272 model_lifecycle.cc:472] "loading: ensemble:1"
I1019 19:28:53.619778 1272 model_lifecycle.cc:839] "successfully loaded 'ensemble'"
I1019 19:28:53.619824 1272 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1019 19:28:53.619851 1272 server.cc:631]
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi |
| | | n-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi |
| | | n-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
I1019 19:28:53.619893 1272 server.cc:674]
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
| tensorrt_llm_bls | 1 | READY |
+------------------+---------+--------+
I1019 19:28:53.644979 1272 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090"
I1019 19:28:53.648626 1272 metrics.cc:770] "Collecting CPU metrics"
I1019 19:28:53.648919 1272 tritonserver.cc:2598]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.50.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_s |
| | hared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /llm/tensorrt_llm/model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| model_config_name | |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Found model: name='ensemble', backend='ensemble'
Found model: name='postprocessing', backend='python'
Found model: name='preprocessing', backend='python'
Found model: name='tensorrt_llm', backend='tensorrtllm'
Found model: name='tensorrt_llm_bls', backend='python'
[WARNING] Adding CORS for the following origins: ['http://localhost']
INFO: Started server process [1272]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
additional notes
I'm not sure if it is controlled by the free_gpu_memory_fraction in tensorrt_llm. How can this be solved?