You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run using openai\openai_frontend\main.py and specify the 8-bit quantized ChatGLM4 model with a size of 9.95G, it is expected to be around 12G. However, during inference, the entire 24G of GPU memory is filled up.
actual behavior
root@docker-desktop:/llm/openai# python3 /llm/openai/openai_frontend/main.py --backend tensorrtllm --model-repository /llm/tensorrt_llm/model_repo --tokenizer /llm/tensorrt_llm/tokenizer_dir
I1019 19:28:29.074797 1272 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204c00000' with size 268435456"
I1019 19:28:29.074857 1272 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1019 19:28:29.173311 1272 model_lifecycle.cc:472] "loading: preprocessing:1"
I1019 19:28:29.177154 1272 model_lifecycle.cc:472] "loading: postprocessing:1"
I1019 19:28:29.180251 1272 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I1019 19:28:29.183750 1272 model_lifecycle.cc:472] "loading: tensorrt_llm_bls:1"
I1019 19:28:29.413431 1272 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1019 19:28:29.413459 1272 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1019 19:28:29.413473 1272 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1019 19:28:29.413485 1272 libtensorrtllm.cc:86] "backend configuration:\n{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I1019 19:28:29.416281 1272 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to true
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 40
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2048
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I1019 19:28:31.814801 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I1019 19:28:31.890846 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1019 19:28:31.891695 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1019 19:28:32.878905 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I1019 19:28:33.534115 1272 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
I1019 19:28:33.534477 1272 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 10194 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 196.76 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10184 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 648.06 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 12.22 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4506
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.00 GiB for max tokens in paged KV cache (288384).
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I1019 19:28:53.616560 1272 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I1019 19:28:53.616756 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'"
I1019 19:28:53.619599 1272 model_lifecycle.cc:472] "loading: ensemble:1"
I1019 19:28:53.619778 1272 model_lifecycle.cc:839] "successfully loaded 'ensemble'"
I1019 19:28:53.619824 1272 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Found model: name='ensemble', backend='ensemble'
Found model: name='postprocessing', backend='python'
Found model: name='preprocessing', backend='python'
Found model: name='tensorrt_llm', backend='tensorrtllm'
Found model: name='tensorrt_llm_bls', backend='python'
[WARNING] Adding CORS for the following origins: ['http://localhost']
INFO: Started server process [1272]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
additional notes
I'm not sure if it is controlled by the free_gpu_memory_fraction in tensorrt_llm. How can this be solved?
The text was updated successfully, but these errors were encountered:
System Info
cpu intel 14700k
gpu rtx 4090
tensorrt_llm 0.13
docker tritonserver:24.09-trtllm-python-py3
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
reference: python/openai
Expected behavior
When I run using openai\openai_frontend\main.py and specify the 8-bit quantized ChatGLM4 model with a size of 9.95G, it is expected to be around 12G. However, during inference, the entire 24G of GPU memory is filled up.
actual behavior
root@docker-desktop:/llm/openai# python3 /llm/openai/openai_frontend/main.py --backend tensorrtllm --model-repository /llm/tensorrt_llm/model_repo --tokenizer /llm/tensorrt_llm/tokenizer_dir
I1019 19:28:29.074797 1272 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204c00000' with size 268435456"
I1019 19:28:29.074857 1272 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1019 19:28:29.173311 1272 model_lifecycle.cc:472] "loading: preprocessing:1"
I1019 19:28:29.177154 1272 model_lifecycle.cc:472] "loading: postprocessing:1"
I1019 19:28:29.180251 1272 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I1019 19:28:29.183750 1272 model_lifecycle.cc:472] "loading: tensorrt_llm_bls:1"
I1019 19:28:29.413431 1272 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1019 19:28:29.413459 1272 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1019 19:28:29.413473 1272 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1019 19:28:29.413485 1272 libtensorrtllm.cc:86] "backend configuration:\n{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I1019 19:28:29.416281 1272 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to true
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 40
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2048
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I1019 19:28:31.814801 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I1019 19:28:31.890846 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1019 19:28:31.891695 1272 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1019 19:28:32.878905 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I1019 19:28:33.534115 1272 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
I1019 19:28:33.534477 1272 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 10194 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 196.76 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10184 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 648.06 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 12.22 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4506
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.00 GiB for max tokens in paged KV cache (288384).
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I1019 19:28:53.616560 1272 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I1019 19:28:53.616756 1272 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'"
I1019 19:28:53.619599 1272 model_lifecycle.cc:472] "loading: ensemble:1"
I1019 19:28:53.619778 1272 model_lifecycle.cc:839] "successfully loaded 'ensemble'"
I1019 19:28:53.619824 1272 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1019 19:28:53.619851 1272 server.cc:631]
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi |
| | | n-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","mi |
| | | n-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
I1019 19:28:53.619893 1272 server.cc:674]
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
| tensorrt_llm_bls | 1 | READY |
+------------------+---------+--------+
I1019 19:28:53.644979 1272 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090"
I1019 19:28:53.648626 1272 metrics.cc:770] "Collecting CPU metrics"
I1019 19:28:53.648919 1272 tritonserver.cc:2598]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.50.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_s |
| | hared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /llm/tensorrt_llm/model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| model_config_name | |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Found model: name='ensemble', backend='ensemble'
Found model: name='postprocessing', backend='python'
Found model: name='preprocessing', backend='python'
Found model: name='tensorrt_llm', backend='tensorrtllm'
Found model: name='tensorrt_llm_bls', backend='python'
[WARNING] Adding CORS for the following origins: ['http://localhost']
INFO: Started server process [1272]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
additional notes
I'm not sure if it is controlled by the free_gpu_memory_fraction in tensorrt_llm. How can this be solved?
The text was updated successfully, but these errors were encountered: