moe kernel Assertion failed when running qwen2-moe-57B-A14B with TP enabled

I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)

[qwen2-moe-57B-A14B](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct) can run well on single `nvidia-A800`. But, if we run it with tp=2 (two A800 or L40), here is what we got:
```
I0709 12:38:32.372394 271 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8101
I0709 12:38:32.372581 271 http_server.cc:4685] Started HTTPService at 0.0.0.0:8100
I0709 12:38:32.424085 271 http_server.cc:320] Started Metrics Service at 0.0.0.0:8102
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_template.h:122)
1       0x7fa66c2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fa66c5f4273 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0xb0b273) [0x7fa66c5f4273]
3       0x7fa66c61561f void tensorrt_llm::MoeGemmRunner<__half, unsigned char>::runGemm<tensorrt_llm::cutlass_extensions::EpilogueOpDefault>(__half const*, unsigned char const*, __half const*, __half const*, __half*, long*, long, long, long, int, CUstream_st*) + 591
4       0x7fa66dce0f67 tensorrt_llm::kernels::CutlassMoeFCRunner<__half, unsigned char, void>::runMoe(void const*, float const*, void const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, void const*, int, int, int, int, int, char*, void*, void*, bool const*, int, void*, int*, int*, tensorrt_llm::kernels::MOEParallelismConfig, tensorrt_llm::kernels::MOEExpertScaleNormalizationMode, CUstream_st*) + 1751
5       0x7fa750f1ad9a tensorrt_llm::plugins::MixtureOfExpertsPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 954
6       0x7fa627706ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fa627706ba9]
7       0x7fa6276dc6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fa6276dc6af]
8       0x7fa6276de320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fa6276de320]
9       0x7fa66e145a7b tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 59
10      0x7fa66e147714 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1188
11      0x7fa66e14d724 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3716
12      0x7fa66e11da68 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 56
13      0x7fa66e1227c7 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 247
14      0x7fa76e4b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa76e4b0253]
15      0x7fa76e158ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa76e158ac3]
16      0x7fa76e1ea850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa76e1ea850]
```
this error occurs when sending large amount of requests to tritonserver.


here is my build config for two A800:
```
python build.py --hf_model_dir /data/cbs/models/Qwen2-57B-A14B-Instruct \
                --dtype float16 \
    --use_inflight_batching \
    --use_gpt_attention_plugin float16 \
    --enable_context_fmha \
    --use_gemm_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 6144 \
    --max_output_len 1024 \
    --max_num_tokens 204800 \
    --use_weight_only \
    --weight_only_precision int8 \
    --tp_size 2 \
    --world_size 2 \
    --tp_mode 2 \
    --output_dir /data/cbs/engines/Qwen2-57B-A14B-Instruct
```

looking for help, it maybe a bug in moe kernel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

moe kernel Assertion failed when running qwen2-moe-57B-A14B with TP enabled #1925

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

moe kernel Assertion failed when running qwen2-moe-57B-A14B with TP enabled #1925

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions