Open
Description
Dear TRT-LLM team,
lets consider sm80 and f16s8, the cutlass example of f16s8 TN mixed gemm shown here is different from TRT-LLM implementation, specifically, to my knowledge, the TRT-LLM one added the dequantization scale, but the cutlass one did not. Then my questions are:
- Is the performance or accuracy of TRT-LLM adding dequantization scale better than cutlass native one in LLM linear cases?
- from here, I see the TRT-LLM one seems load operand B(s8) using LDS not LDSM, but I can't find the f16s8 LDS specialization in MmaTensorOpMultiplicandTileIterator, only find LDS specialization for TF32, which make me confused with the “LDS". Am I missing something?
Thanks your time!