The GPU memory usage is too high.

### System Info

cpu intel 14700k
gpu rtx 4090
tensorrt_llm 0.13
docker tritonserver:24.09-trtllm-python-py3

### Who can help?

@Tracin 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

reference:   [python/openai](https://github.com/triton-inference-server/server/tree/main/python/openai)

### Expected behavior

When I run using openai\openai_frontend\main.py and specify the 8-bit quantized ChatGLM4 model with a size of 9.95G, it is expected to be around 12G. However, during inference, the entire 24G of GPU memory is filled up.

### actual behavior

root@docker-desktop:/llm/openai# python3 /llm/openai/openai_frontend/main.py --backend tensorrtllm --model-repository /llm/tensorrt_llm/model_repo --tokenizer /llm/tensorrt_llm/tokenizer_dir
I1019 19:28:29.074797 1272 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204c00000' with size 268435456"
I1019 19:28:29.074857 1272 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1019 19:28:29.173311 1272 model_lifecycle.cc:472] "loading: preprocessing:1"
I1019 19:28:29.177154 1272 model_lifecycle.cc:472] "loading: postprocessing:1"
I1019 19:28:29.180251 1272 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I1019 19:28:29.183750 1272 model_lifecycle.cc:472] "loading: tensorrt_llm_bls:1"
I1019 19:28:29.413431 1272 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1019 19:28:29.413459 1272 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1019 19:28:29.413473 1272 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1019 19:28:29.413485 1272 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I1019 19:28:29.416281 1272 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to true
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 40
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2048
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I1019 19:28:31.814801 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I1019 19:28:31.890846 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1019 19:28:31.891695 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1019 19:28:32.878905 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I1019 19:28:33.534115 1272 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
I1019 19:28:33.534477 1272 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 10194 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 196.76 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10184 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 648.06 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 12.22 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4506
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.00 GiB for max tokens in paged KV cache (288384).
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I1019 19:28:53.616560 1272 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I1019 19:28:53.616756 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'"
I1019 19:28:53.619599 1272 model_lifecycle.cc:472] "loading: ensemble:1"
I1019 19:28:53.619778 1272 model_lifecycle.cc:839] "successfully loaded 'ensemble'"
I1019 19:28:53.619824 1272 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1019 19:28:53.619851 1272 server.cc:631]
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                         |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi |
|             |                                                                 | n-compute-capability":"6.000000","default-max-batch-size":"4"}}                                |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi |
|             |                                                                 | n-compute-capability":"6.000000","default-max-batch-size":"4"}}                                |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+

I1019 19:28:53.619893 1272 server.cc:674]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+

I1019 19:28:53.644979 1272 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090"
I1019 19:28:53.648626 1272 metrics.cc:770] "Collecting CPU metrics"
I1019 19:28:53.648919 1272 tritonserver.cc:2598]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                       |
| server_version                   | 2.50.0                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_s |
|                                  | hared_memory binary_tensor_data parameters statistics trace logging                                                                          |
| model_repository_path[0]         | /llm/tensorrt_llm/model_repo                                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                            |
| model_config_name                |                                                                                                                                              |
| rate_limit                       | OFF                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                     |
| min_supported_compute_capability | 6.0                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                           |
| cache_enabled                    | 0                                                                                                                                            |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Found model: name='ensemble', backend='ensemble'
Found model: name='postprocessing', backend='python'
Found model: name='preprocessing', backend='python'
Found model: name='tensorrt_llm', backend='tensorrtllm'
Found model: name='tensorrt_llm_bls', backend='python'
[WARNING] Adding CORS for the following origins: ['http://localhost']
INFO:     Started server process [1272]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

### additional notes

I'm not sure if it is controlled by the free_gpu_memory_fraction in tensorrt_llm. How can this be solved?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The GPU memory usage is too high. #625

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The GPU memory usage is too high. #625

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions