Skip to content

moe kernel Assertion failed when running qwen2-moe-57B-A14B with TP enabled #1925

Open
@handoku

Description

@handoku

I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)

qwen2-moe-57B-A14B can run well on single nvidia-A800. But, if we run it with tp=2 (two A800 or L40), here is what we got:

I0709 12:38:32.372394 271 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8101
I0709 12:38:32.372581 271 http_server.cc:4685] Started HTTPService at 0.0.0.0:8100
I0709 12:38:32.424085 271 http_server.cc:320] Started Metrics Service at 0.0.0.0:8102
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_template.h:122)
1       0x7fa66c2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fa66c5f4273 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0xb0b273) [0x7fa66c5f4273]
3       0x7fa66c61561f void tensorrt_llm::MoeGemmRunner<__half, unsigned char>::runGemm<tensorrt_llm::cutlass_extensions::EpilogueOpDefault>(__half const*, unsigned char const*, __half const*, __half const*, __half*, long*, long, long, long, int, CUstream_st*) + 591
4       0x7fa66dce0f67 tensorrt_llm::kernels::CutlassMoeFCRunner<__half, unsigned char, void>::runMoe(void const*, float const*, void const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, void const*, int, int, int, int, int, char*, void*, void*, bool const*, int, void*, int*, int*, tensorrt_llm::kernels::MOEParallelismConfig, tensorrt_llm::kernels::MOEExpertScaleNormalizationMode, CUstream_st*) + 1751
5       0x7fa750f1ad9a tensorrt_llm::plugins::MixtureOfExpertsPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 954
6       0x7fa627706ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fa627706ba9]
7       0x7fa6276dc6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fa6276dc6af]
8       0x7fa6276de320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fa6276de320]
9       0x7fa66e145a7b tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 59
10      0x7fa66e147714 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1188
11      0x7fa66e14d724 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3716
12      0x7fa66e11da68 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 56
13      0x7fa66e1227c7 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 247
14      0x7fa76e4b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa76e4b0253]
15      0x7fa76e158ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa76e158ac3]
16      0x7fa76e1ea850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa76e1ea850]

this error occurs when sending large amount of requests to tritonserver.

here is my build config for two A800:

python build.py --hf_model_dir /data/cbs/models/Qwen2-57B-A14B-Instruct \
                --dtype float16 \
    --use_inflight_batching \
    --use_gpt_attention_plugin float16 \
    --enable_context_fmha \
    --use_gemm_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 6144 \
    --max_output_len 1024 \
    --max_num_tokens 204800 \
    --use_weight_only \
    --weight_only_precision int8 \
    --tp_size 2 \
    --world_size 2 \
    --tp_mode 2 \
    --output_dir /data/cbs/engines/Qwen2-57B-A14B-Instruct

looking for help, it maybe a bug in moe kernel.

Metadata

Metadata

Assignees

Labels

Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).functionality issuetriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions