Open
Description
I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)
qwen2-moe-57B-A14B can run well on single nvidia-A800
. But, if we run it with tp=2 (two A800 or L40), here is what we got:
I0709 12:38:32.372394 271 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8101
I0709 12:38:32.372581 271 http_server.cc:4685] Started HTTPService at 0.0.0.0:8100
I0709 12:38:32.424085 271 http_server.cc:320] Started Metrics Service at 0.0.0.0:8102
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_template.h:122)
1 0x7fa66c2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fa66c5f4273 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0xb0b273) [0x7fa66c5f4273]
3 0x7fa66c61561f void tensorrt_llm::MoeGemmRunner<__half, unsigned char>::runGemm<tensorrt_llm::cutlass_extensions::EpilogueOpDefault>(__half const*, unsigned char const*, __half const*, __half const*, __half*, long*, long, long, long, int, CUstream_st*) + 591
4 0x7fa66dce0f67 tensorrt_llm::kernels::CutlassMoeFCRunner<__half, unsigned char, void>::runMoe(void const*, float const*, void const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, void const*, int, int, int, int, int, char*, void*, void*, bool const*, int, void*, int*, int*, tensorrt_llm::kernels::MOEParallelismConfig, tensorrt_llm::kernels::MOEExpertScaleNormalizationMode, CUstream_st*) + 1751
5 0x7fa750f1ad9a tensorrt_llm::plugins::MixtureOfExpertsPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 954
6 0x7fa627706ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fa627706ba9]
7 0x7fa6276dc6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fa6276dc6af]
8 0x7fa6276de320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fa6276de320]
9 0x7fa66e145a7b tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 59
10 0x7fa66e147714 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1188
11 0x7fa66e14d724 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3716
12 0x7fa66e11da68 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 56
13 0x7fa66e1227c7 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 247
14 0x7fa76e4b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa76e4b0253]
15 0x7fa76e158ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa76e158ac3]
16 0x7fa76e1ea850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa76e1ea850]
this error occurs when sending large amount of requests to tritonserver.
here is my build config for two A800:
python build.py --hf_model_dir /data/cbs/models/Qwen2-57B-A14B-Instruct \
--dtype float16 \
--use_inflight_batching \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--max_batch_size 128 \
--max_input_len 6144 \
--max_output_len 1024 \
--max_num_tokens 204800 \
--use_weight_only \
--weight_only_precision int8 \
--tp_size 2 \
--world_size 2 \
--tp_mode 2 \
--output_dir /data/cbs/engines/Qwen2-57B-A14B-Instruct
looking for help, it maybe a bug in moe kernel.