NVIDIA/TensorRT-LLM

#5403

· zoheth opened

on Jun 23, 2025

Poor performance after FP8 Quantization for Llama 3.1 on PyTorch backend

#5370

· geaned opened

on Jun 19, 2025

Abnormal Performance Scaling of W4AFP8 vs FP8 on H20-141G with Deepseek-R1 Models

Investigating

Performance

triaged

#5127

· Nekofish-L opened

on Jun 11, 2025

[WIP] Introduce Flux MoE operator

Community want to contribute

Performance

triaged

NVIDIA/TensorRT-LLM

#4948

· lancelly opened

on Jun 5, 2025

How is the performance of the model with pytorch as the backend

Investigating

Performance

triaged

#4745

· oppolll opened

on May 29, 2025

Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B.

Community want to contribute

Performance

triaged

NVIDIA/TensorRT-LLM

#4005

· shaonvidia opened

on May 1, 2025

Disaggregated Prefill & Decode serving optimizations

#3963

· mk-nvidia opened

on Apr 29, 2025

MoE optimizations

#3962

· mk-nvidia opened

on Apr 29, 2025

feat/add latency support for trtllm bench

Community want to contribute

Performance

triaged

NVIDIA/TensorRT-LLM

#3730

· danielafrimi opened

on Apr 21, 2025

chore: support getting the latest iteration status

Community Engagement

Community want to contribute

Performance

triaged

NVIDIA/TensorRT-LLM

#3414

· pansicheng opened

on Apr 9, 2025

Executor API: How to get throughput

Investigating

Performance

triaged

#3142

· khayamgondal opened

on Mar 28, 2025

[QST] why the implementation of f16xs8 mixed gemm is different between TRT-LLM and native cutlass mixed gemm example?

Investigating

Performance

triaged

#2659

· danielhua23 opened

on Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Conditionally enable SWAP AB for speculative decoding

Poor performance after FP8 Quantization for Llama 3.1 on PyTorch backend

Abnormal Performance Scaling of W4AFP8 vs FP8 on H20-141G with Deepseek-R1 Models

[WIP] Introduce Flux MoE operator

How is the performance of the model with pytorch as the backend

Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B.

Disaggregated Prefill & Decode serving optimizations

MoE optimizations

feat/add latency support for trtllm bench

chore: support getting the latest iteration status

Executor API: How to get throughput

[QST] why the implementation of f16xs8 mixed gemm is different between TRT-LLM and native cutlass mixed gemm example?

Issues

Search results