| Documentation | Contact Us |
- [2025/09] 🔥 RTP-LLM 0.2.0 release with enhanced performance and new features
- [2025/01] 🚀 RTP-LLM now supports Prefill/Decode separation with detailed technical report
- [2025/01] 🌟 Qwen series model and bert embedding model now supported on Yitian ARM CPU
- [2024/06] 🔄 Major refactor: Scheduling and batching framework rewritten in C++, complete GPU memory management, and new Device backend
- [2024/06] 🏗️ Multi-hardware support in development: AMD ROCm, Intel CPU and ARM CPU support coming soon
More
RTP-LLM is a Large Language Model (LLM) inference acceleration engine developed by Alibaba's Foundation Model Inference Team. It is widely used within Alibaba Group, supporting LLM service across multiple business units including Taobao, Tmall, Idlefish, Cainiao, Amap, Ele.me, AE, and Lazada.
RTP-LLM is a sub-project of the havenask project.
Trusted and deployed across numerous LLM scenarios:
- Taobao Wenwen
- Alibaba's international AI platform, Aidge
- OpenSearch LLM Smart Q&A Edition
- Large Language Model based Long-tail Query Rewriting in Taobao Search
- Utilizes high-performance CUDA kernels, including PagedAttention, FlashAttention, FlashDecoding, etc.
- Implements WeightOnly INT8 Quantization with automatic quantization at load time
- Support WeightOnly INT4 Quantization with GPTQ and AWQ
- Adaptive KVCache Quantization
- Detailed optimization of dynamic batching overhead at the framework level
- Specially optimized for the V100 GPU
- Seamless integration with the HuggingFace models, supporting multiple weight formats such as SafeTensors, Pytorch, and Megatron
- Deploys multiple LoRA services with a single model instance
- Handles multimodal inputs (combining images and text)
- Enables multi-machine/multi-GPU tensor parallelism
- Supports P-tuning models
- Loads pruned irregular models
- Contextual Prefix Cache for multi-turn dialogues
- System Prompt Cache
- Speculative Decoding
Learn more about RTP-LLM's performance in our benchmark reports:
Our project is mainly based on FasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. We also draw inspiration from vllm, transformers, llava, and qwen-vl. We thank these projects for their inspiration and help.