You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Upon starting the Triton server, the following error occurs:
I0114 09:00:18.017712 880 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x755bdc000000' with size 268435456"
I0114 09:00:18.029910 880 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0114 09:00:18.029919 880 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0114 09:00:18.200807 880 model_lifecycle.cc:473] "loading: postprocessing:1"
I0114 09:00:18.200845 880 model_lifecycle.cc:473] "loading: preprocessing:1"
I0114 09:00:18.200881 880 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I0114 09:00:18.200904 880 model_lifecycle.cc:473] "loading: tensorrt_llm_bls:1"
I0114 09:00:18.378243 880 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0114 09:00:18.378292 880 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0114 09:00:18.378299 880 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0114 09:00:18.378306 880 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I0114 09:00:18.402251 880 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][WARNING] Chunked context is not supported for this configuration and will be disabled. Related configs: RNNBased: 0, KVCacheEnabled: 1, PagedContextFMHA: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][WARNING] Fix optionalParams : KV cache reuse disabled because model was not built with paged context FMHA support
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (256) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0114 09:00:21.592649 880 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I0114 09:00:21.818542 880 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0114 09:00:22.615258 880 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0114 09:00:23.110505 880 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][INFO] Loaded engine size: 5510 MiB
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I0114 09:00:24.115872 880 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(T) <= remaining_buffer_size (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/serializationUtils.h:32)
1 0x755b190d47df tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95
2 0x755b192eb1c2 tensorrt_llm::kernels::jit::CubinObj::CubinObj(void const*, unsigned long) + 274
3 0x755b193017d4 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 292
4 0x755b19301132 tensorrt_llm::kernels::DecoderXQARunner::Resource::Resource(void const*, unsigned long) + 50
5 0x755b0cc5e149 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 1193
6 0x755b0cc95232 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 18
7 0x755b0cc952b2 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
8 0x755ad6b53b5b /usr/local/tensorrt/lib/libnvinfer.so.10(+0x11deb5b) [0x755ad6b53b5b]
9 0x755ad6b5045e /usr/local/tensorrt/lib/libnvinfer.so.10(+0x11db45e) [0x755ad6b5045e]
10 0x755ad6a832b7 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x110e2b7) [0x755ad6a832b7]
11 0x755ad6a81e6a /usr/local/tensorrt/lib/libnvinfer.so.10(+0x110ce6a) [0x755ad6a81e6a]
12 0x755ad6a99a77 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1124a77) [0x755ad6a99a77]
13 0x755ad6a9d5b6 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x11285b6) [0x755ad6a9d5b6]
14 0x755ad6a9db06 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1128b06) [0x755ad6a9db06]
15 0x755ad6ad4fc7 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x115ffc7) [0x755ad6ad4fc7]
16 0x755ad6ad5bd8 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1160bd8) [0x755ad6ad5bd8]
17 0x755ad6ad5cdb /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1160cdb) [0x755ad6ad5cdb]
18 0x755b1b12f275 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(tensorrt_llm::runtime::RawEngine const&, nvinfer1::ILogger*, float, bool) + 1413
19 0x755b1b58d428 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1304
20 0x755b1b51151e tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 526
21 0x755b1b628029 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185
22 0x755b1b6286bd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229
23 0x755b1b62990a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474
24 0x755b1b60f757 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87
25 0x755c3a2af38e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3238e) [0x755c3a2af38e]
26 0x755c3a2abc39 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185
27 0x755c3a2ac182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66
28 0x755c3a299319 TRITONBACKEND_ModelInstanceInitialize + 153
29 0x755c43dd8619 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1619) [0x755c43dd8619]
30 0x755c43dd90a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20a2) [0x755c43dd90a2]
31 0x755c43dbecc3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cc3) [0x755c43dbecc3]
32 0x755c43dbf074 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188074) [0x755c43dbf074]
33 0x755c43dc865d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19165d) [0x755c43dc865d]
34 0x755c45578ec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x755c45578ec3]
35 0x755c43db5ee2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17eee2) [0x755c43db5ee2]
36 0x755c43dc3dac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cdac) [0x755c43dc3dac]
37 0x755c43dc7de2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190de2) [0x755c43dc7de2]
38 0x755c43ec7ca1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x290ca1) [0x755c43ec7ca1]
39 0x755c43ecaffc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293ffc) [0x755c43ecaffc]
40 0x755c440276f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f06f5) [0x755c440276f5]
41 0x755c45af5db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x755c45af5db4]
42 0x755c45573a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x755c45573a94]
43 0x755c45600a34 __clone + 68
additional notes
N/A
The text was updated successfully, but these errors were encountered:
System Info
Here's a revised version of the issue description with improved wording:
System Specifications:
nvcr.io/nvidia/tritonserver:nvcr.io/nvidia/tritonserver
Configuration Details:
Model Information:
/engines/llama3.1-8B
/models/Meta-Llama-3.1-8B-Instruct
/repo/llama3/tensorrt_llm/1
Engine Configuration:
\
How to fix the above error?
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
the config of the engine
config.pbtxt
files:Expected behavior
start the triton server using TensorRT-LLM
actual behavior
Upon starting the Triton server, the following error occurs:
additional notes
N/A
The text was updated successfully, but these errors were encountered: