[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty()

Errors happens under load after some time.

Executor API
[version 548b5b7310](https://github.com/NVIDIA/TensorRT-LLM/tree/548b5b73106aaf7374955e1c37aad677678ebc7b)


```
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:256)
1             0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2       0x7f07f479dd5e /home/askhoroshev/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x827d5e) [0x7f07f479dd5e]
3       0x7f07f6a8346d tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 141
4       0x7f07f6a835cf tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 223
5       0x7f07f6a84af7 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 951
6       0x7f07f6a86558 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7       0x7f07f6a881bd tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8       0x7f07f6a3f34c tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 300
9       0x7f07f6adaf07 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1479
10      0x7f07f6b704a1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11      0x7f07f6b775bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 956
12      0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
13      0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
14      0x7f0792e0b8d3 clone + 67
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:256)
1             0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2       0x7f07f479dd5e /home/askhoroshev/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x827d5e) [0x7f07f479dd5e]
3       0x7f07f6a8346d tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 141
4       0x7f07f6a835cf tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 223
5       0x7f07f6a84af7 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 951
6       0x7f07f6a86558 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7       0x7f07f6a881bd tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8       0x7f07f6a3f34c tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 300
9       0x7f07f6adaf07 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1479
10      0x7f07f6b704a1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11      0x7f07f6b775bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 956
12      0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
13      0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
14      0x7f0792e0b8d3 clone + 67 (/home/askhoroshev/TensorRT-LLM/modules/executor_server/src/serverImpl.cpp:412)
1             0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2             0x41f1fa /home/askhoroshev/TensorRT-LLM/cpp/build/modules/executor_server/executor_server() [0x41f1fa]
3             0x4acb90 /home/askhoroshev/TensorRT-LLM/cpp/build/modules/executor_server/executor_server() [0x4acb90]
4       0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
5       0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
6       0x7f0792e0b8d3 clone + 67
```


config.json
```json
{
    "version": "0.16.0.dev2024120300",
    "pretrained_config": {
        "architecture": "DeepseekForCausalLM",
        "dtype": "bfloat16",
        "vocab_size": 42064,
        "hidden_size": 2048,
        "num_hidden_layers": 28,
        "num_attention_heads": 16,
        "hidden_act": "swiglu",
        "logits_dtype": "float32",
        "norm_epsilon": 1e-05,
        "runtime_defaults": null,
        "position_embedding_type": "rope_gpt_neox",
        "num_key_value_heads": 8,
        "intermediate_size": 14336,
        "max_position_embeddings": 131072,
        "mapping": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cp_size": 1,
            "tp_size": 1,
            "pp_size": 1,
            "moe_tp_size": 1,
            "moe_ep_size": 1
        },
        "quantization": {
            "quant_algo": null,
            "kv_cache_quant_algo": null,
            "group_size": 128,
            "smoothquant_val": 0.5,
            "clamp_val": null,
            "use_meta_recipe": false,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": null
        },
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "share_embedding_table": false,
        "head_size": 128,
        "qk_layernorm": false,
        "rotary_embedding_dim": 128,
        "return_context_hidden": false,
        "logits_type": "float32",
        "moe_intermediate_size": 1792,
        "rotary_base": 300000,
        "rotary_scaling": null,
        "moe": {
            "num_experts": 64,
            "shared_expert_intermediate_size": 3584,
            "top_k": 6,
            "normalization_mode": 0
        }
    },
    "build_config": {
        "max_input_len": 130048,
        "max_seq_len": 131072,
        "opt_batch_size": 8,
        "max_batch_size": 256,
        "max_beam_width": 1,
        "max_num_tokens": 4096,
        "opt_num_tokens": 256,
        "max_prompt_embedding_table_size": 0,
        "kv_cache_type": "PAGED",
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": true,
        "force_num_profiles": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "max_draft_len": 0,
        "speculative_decoding_mode": 1,
        "use_refit": false,
        "input_timing_cache": null,
        "output_timing_cache": "model.cache",
        "lora_config": {
            "lora_dir": [],
            "lora_ckpt_source": "hf",
            "max_lora_rank": 64,
            "lora_target_modules": [],
            "trtllm_modules_to_hf_modules": {}
        },
        "auto_parallel_config": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cluster_key": "H100-PCIe",
            "cluster_info": null,
            "sharding_cost_model": "alpha_beta",
            "comm_cost_model": "alpha_beta",
            "enable_pipeline_parallelism": false,
            "enable_shard_unbalanced_shape": false,
            "enable_shard_dynamic_shape": false,
            "enable_reduce_scatter": true,
            "builder_flags": null,
            "debug_mode": false,
            "infer_shape": true,
            "validation_mode": false,
            "same_buffer_io": {
                "past_key_value_(\\d+)": "present_key_value_\\1"
            },
            "same_spec_io": {},
            "sharded_io_allowlist": [
                "past_key_value_\\d+",
                "present_key_value_\\d*"
            ],
            "fill_weights": false,
            "parallel_config_cache": null,
            "profile_cache": null,
            "dump_path": null,
            "debug_outputs": []
        },
        "weight_sparsity": false,
        "weight_streaming": false,
        "plugin_config": {
            "dtype": "bfloat16",
            "bert_attention_plugin": "auto",
            "gpt_attention_plugin": "auto",
            "gemm_plugin": "bfloat16",
            "gemm_swiglu_plugin": null,
            "fp8_rowwise_gemm_plugin": null,
            "smooth_quant_gemm_plugin": null,
            "qserve_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "smooth_quant_plugins": true,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "moe_plugin": "auto",
            "mamba_conv1d_plugin": "auto",
            "low_latency_gemm_plugin": null,
            "low_latency_gemm_swiglu_plugin": null,
            "context_fmha": true,
            "bert_context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "reduce_fusion": false,
            "user_buffer": false,
            "tokens_per_block": 64,
            "use_paged_context_fmha": true,
            "use_fp8_context_fmha": false,
            "multiple_profiles": false,
            "paged_state": false,
            "streamingllm": false,
            "manage_weights": false,
            "use_fused_mlp": true,
            "pp_reduce_scatter": false
        },
        "use_strip_plan": false,
        "max_encoder_input_len": 1024,
        "use_fused_mlp": true,
        "monitor_memory": false,
        "use_mrope": false
    }
}
```

Chunked context and context reuse are enabled.

Starting logs
```
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Set logger level to INFO
[TensorRT-LLM][INFO] ExecutorServer on rank 0 starting...
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 131072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 38675 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 453.61 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 38668 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 809.11 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 692.36 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.21 GiB, available: 38.82 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 5112
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2048
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 34.95 GiB for max tokens in paged KV cache (327168).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions