Open
Description
Errors happens under load after some time.
Executor API
version 548b5b7310
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:256)
1 0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2 0x7f07f479dd5e /home/askhoroshev/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x827d5e) [0x7f07f479dd5e]
3 0x7f07f6a8346d tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 141
4 0x7f07f6a835cf tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 223
5 0x7f07f6a84af7 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 951
6 0x7f07f6a86558 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7 0x7f07f6a881bd tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8 0x7f07f6a3f34c tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 300
9 0x7f07f6adaf07 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1479
10 0x7f07f6b704a1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11 0x7f07f6b775bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 956
12 0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
13 0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
14 0x7f0792e0b8d3 clone + 67
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:256)
1 0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2 0x7f07f479dd5e /home/askhoroshev/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x827d5e) [0x7f07f479dd5e]
3 0x7f07f6a8346d tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 141
4 0x7f07f6a835cf tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 223
5 0x7f07f6a84af7 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 951
6 0x7f07f6a86558 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7 0x7f07f6a881bd tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8 0x7f07f6a3f34c tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 300
9 0x7f07f6adaf07 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1479
10 0x7f07f6b704a1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11 0x7f07f6b775bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 956
12 0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
13 0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
14 0x7f0792e0b8d3 clone + 67 (/home/askhoroshev/TensorRT-LLM/modules/executor_server/src/serverImpl.cpp:412)
1 0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2 0x41f1fa /home/askhoroshev/TensorRT-LLM/cpp/build/modules/executor_server/executor_server() [0x41f1fa]
3 0x4acb90 /home/askhoroshev/TensorRT-LLM/cpp/build/modules/executor_server/executor_server() [0x4acb90]
4 0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
5 0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
6 0x7f0792e0b8d3 clone + 67
config.json
{
"version": "0.16.0.dev2024120300",
"pretrained_config": {
"architecture": "DeepseekForCausalLM",
"dtype": "bfloat16",
"vocab_size": 42064,
"hidden_size": 2048,
"num_hidden_layers": 28,
"num_attention_heads": 16,
"hidden_act": "swiglu",
"logits_dtype": "float32",
"norm_epsilon": 1e-05,
"runtime_defaults": null,
"position_embedding_type": "rope_gpt_neox",
"num_key_value_heads": 8,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mapping": {
"world_size": 1,
"gpus_per_node": 8,
"cp_size": 1,
"tp_size": 1,
"pp_size": 1,
"moe_tp_size": 1,
"moe_ep_size": 1
},
"quantization": {
"quant_algo": null,
"kv_cache_quant_algo": null,
"group_size": 128,
"smoothquant_val": 0.5,
"clamp_val": null,
"use_meta_recipe": false,
"has_zero_point": false,
"pre_quant_scale": false,
"exclude_modules": null
},
"use_parallel_embedding": false,
"embedding_sharding_dim": 0,
"share_embedding_table": false,
"head_size": 128,
"qk_layernorm": false,
"rotary_embedding_dim": 128,
"return_context_hidden": false,
"logits_type": "float32",
"moe_intermediate_size": 1792,
"rotary_base": 300000,
"rotary_scaling": null,
"moe": {
"num_experts": 64,
"shared_expert_intermediate_size": 3584,
"top_k": 6,
"normalization_mode": 0
}
},
"build_config": {
"max_input_len": 130048,
"max_seq_len": 131072,
"opt_batch_size": 8,
"max_batch_size": 256,
"max_beam_width": 1,
"max_num_tokens": 4096,
"opt_num_tokens": 256,
"max_prompt_embedding_table_size": 0,
"kv_cache_type": "PAGED",
"gather_context_logits": false,
"gather_generation_logits": false,
"strongly_typed": true,
"force_num_profiles": null,
"profiling_verbosity": "layer_names_only",
"enable_debug_output": false,
"max_draft_len": 0,
"speculative_decoding_mode": 1,
"use_refit": false,
"input_timing_cache": null,
"output_timing_cache": "model.cache",
"lora_config": {
"lora_dir": [],
"lora_ckpt_source": "hf",
"max_lora_rank": 64,
"lora_target_modules": [],
"trtllm_modules_to_hf_modules": {}
},
"auto_parallel_config": {
"world_size": 1,
"gpus_per_node": 8,
"cluster_key": "H100-PCIe",
"cluster_info": null,
"sharding_cost_model": "alpha_beta",
"comm_cost_model": "alpha_beta",
"enable_pipeline_parallelism": false,
"enable_shard_unbalanced_shape": false,
"enable_shard_dynamic_shape": false,
"enable_reduce_scatter": true,
"builder_flags": null,
"debug_mode": false,
"infer_shape": true,
"validation_mode": false,
"same_buffer_io": {
"past_key_value_(\\d+)": "present_key_value_\\1"
},
"same_spec_io": {},
"sharded_io_allowlist": [
"past_key_value_\\d+",
"present_key_value_\\d*"
],
"fill_weights": false,
"parallel_config_cache": null,
"profile_cache": null,
"dump_path": null,
"debug_outputs": []
},
"weight_sparsity": false,
"weight_streaming": false,
"plugin_config": {
"dtype": "bfloat16",
"bert_attention_plugin": "auto",
"gpt_attention_plugin": "auto",
"gemm_plugin": "bfloat16",
"gemm_swiglu_plugin": null,
"fp8_rowwise_gemm_plugin": null,
"smooth_quant_gemm_plugin": null,
"qserve_gemm_plugin": null,
"identity_plugin": null,
"layernorm_quantization_plugin": null,
"rmsnorm_quantization_plugin": null,
"nccl_plugin": null,
"lora_plugin": null,
"weight_only_groupwise_quant_matmul_plugin": null,
"weight_only_quant_matmul_plugin": null,
"smooth_quant_plugins": true,
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"moe_plugin": "auto",
"mamba_conv1d_plugin": "auto",
"low_latency_gemm_plugin": null,
"low_latency_gemm_swiglu_plugin": null,
"context_fmha": true,
"bert_context_fmha_fp32_acc": false,
"paged_kv_cache": true,
"remove_input_padding": true,
"reduce_fusion": false,
"user_buffer": false,
"tokens_per_block": 64,
"use_paged_context_fmha": true,
"use_fp8_context_fmha": false,
"multiple_profiles": false,
"paged_state": false,
"streamingllm": false,
"manage_weights": false,
"use_fused_mlp": true,
"pp_reduce_scatter": false
},
"use_strip_plan": false,
"max_encoder_input_len": 1024,
"use_fused_mlp": true,
"monitor_memory": false,
"use_mrope": false
}
}
Chunked context and context reuse are enabled.
Starting logs
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Set logger level to INFO
[TensorRT-LLM][INFO] ExecutorServer on rank 0 starting...
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 131072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 38675 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 453.61 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 38668 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 809.11 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 692.36 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.21 GiB, available: 38.82 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 5112
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2048
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 34.95 GiB for max tokens in paged KV cache (327168).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.