[QST] why the implementation of f16xs8 mixed gemm is different between TRT-LLM and native cutlass mixed gemm example?

Dear TRT-LLM team,

lets consider sm80 and f16s8, the cutlass example of f16s8 TN mixed gemm shown[ here](https://github.com/NVIDIA/cutlass/pull/1084/files#diff-48de2b167ad3cf3321f972270331653a199001283f2c59fc8f5a70f2d14f7082R66) is different from TRT-LLM [implementation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h#L112), specifically, to my knowledge, the TRT-LLM one added the dequantization scale, but the cutlass one did not. Then my questions are:

1. Is the performance or accuracy of TRT-LLM adding dequantization scale better than cutlass native one in LLM linear cases?
2. from [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/threadblock/dq_mma_pipelined_finegrained.h#L465-L467), I see the TRT-LLM one seems load operand B(s8) using LDS not LDSM, but I can't find the f16s8 LDS specialization in [MmaTensorOpMultiplicandTileIterator](https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h), only find LDS specialization for TF32, which make me confused with the “LDS". Am I missing something?

Thanks your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] why the implementation of f16xs8 mixed gemm is different between TRT-LLM and native cutlass mixed gemm example? #2659

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] why the implementation of f16xs8 mixed gemm is different between TRT-LLM and native cutlass mixed gemm example? #2659

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions