Skip to content
/ rtp-llm Public
forked from alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

License

Notifications You must be signed in to change notification settings

zerozw/rtp-llm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

license issue resolution open issues


| Documentation | Contact Us |

News

  • [2025/09] 🔥 RTP-LLM 0.2.0 release with enhanced performance and new features
  • [2025/01] 🚀 RTP-LLM now supports Prefill/Decode separation with detailed technical report
  • [2025/01] 🌟 Qwen series model and bert embedding model now supported on Yitian ARM CPU
  • [2024/06] 🔄 Major refactor: Scheduling and batching framework rewritten in C++, complete GPU memory management, and new Device backend
  • [2024/06] 🏗️ Multi-hardware support in development: AMD ROCm, Intel CPU and ARM CPU support coming soon
More

About

RTP-LLM is a Large Language Model (LLM) inference acceleration engine developed by Alibaba's Foundation Model Inference Team. It is widely used within Alibaba Group, supporting LLM service across multiple business units including Taobao, Tmall, Idlefish, Cainiao, Amap, Ele.me, AE, and Lazada.

RTP-LLM is a sub-project of the havenask project.

Key Features

🏢 Production Proven

Trusted and deployed across numerous LLM scenarios:

⚡ High Performance

  • Utilizes high-performance CUDA kernels, including PagedAttention, FlashAttention, FlashDecoding, etc.
  • Implements WeightOnly INT8 Quantization with automatic quantization at load time
  • Support WeightOnly INT4 Quantization with GPTQ and AWQ
  • Adaptive KVCache Quantization
  • Detailed optimization of dynamic batching overhead at the framework level
  • Specially optimized for the V100 GPU

🔧 Flexibility and Ease of Use

  • Seamless integration with the HuggingFace models, supporting multiple weight formats such as SafeTensors, Pytorch, and Megatron
  • Deploys multiple LoRA services with a single model instance
  • Handles multimodal inputs (combining images and text)
  • Enables multi-machine/multi-GPU tensor parallelism
  • Supports P-tuning models

🚀 Advanced Acceleration Techniques

  • Loads pruned irregular models
  • Contextual Prefix Cache for multi-turn dialogues
  • System Prompt Cache
  • Speculative Decoding

Getting Started

Benchmark and Performance

Learn more about RTP-LLM's performance in our benchmark reports:

Acknowledgments

Our project is mainly based on FasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. We also draw inspiration from vllm, transformers, llava, and qwen-vl. We thank these projects for their inspiration and help.

Contact Us

DingTalk Group

WeChat Group

About

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 51.0%
  • Python 29.6%
  • Cuda 11.2%
  • Starlark 3.2%
  • Java 3.1%
  • C 1.1%
  • Other 0.8%