GitHub - alibaba/rtp-llm: RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

News

[2025/09] 🔥 RTP-LLM 0.2.0 release with enhanced performance and new features
[2025/01] 🚀 RTP-LLM now supports Prefill/Decode separation with detailed technical report
[2025/01] 🌟 Qwen series model and bert embedding model now supported on Yitian ARM CPU
[2024/06] 🔄 Major refactor: Scheduling and batching framework rewritten in C++, complete GPU memory management, and new Device backend
[2024/06] 🏗️ Multi-hardware support in development: AMD ROCm, Intel CPU and ARM CPU support coming soon

More

About

RTP-LLM is a Large Language Model (LLM) inference acceleration engine developed by Alibaba's Foundation Model Inference Team. It is widely used within Alibaba Group, supporting LLM service across multiple business units including Taobao, Tmall, Idlefish, Cainiao, Amap, Ele.me, AE, and Lazada.

RTP-LLM is a sub-project of the havenask project.

Key Features

🏢 Production Proven

Trusted and deployed across numerous LLM scenarios:

Taobao Wenwen
Alibaba's international AI platform, Aidge
OpenSearch LLM Smart Q&A Edition
Large Language Model based Long-tail Query Rewriting in Taobao Search

⚡ High Performance

Utilizes high-performance CUDA kernels, including PagedAttention, FlashAttention, FlashDecoding, etc.
Implements WeightOnly INT8 Quantization with automatic quantization at load time
Support WeightOnly INT4 Quantization with GPTQ and AWQ
Adaptive KVCache Quantization
Detailed optimization of dynamic batching overhead at the framework level
Specially optimized for the V100 GPU

🔧 Flexibility and Ease of Use

Seamless integration with the HuggingFace models, supporting multiple weight formats such as SafeTensors, Pytorch, and Megatron
Deploys multiple LoRA services with a single model instance
Handles multimodal inputs (combining images and text)
Enables multi-machine/multi-GPU tensor parallelism
Supports P-tuning models

🚀 Advanced Acceleration Techniques

Loads pruned irregular models
Contextual Prefix Cache for multi-turn dialogues
System Prompt Cache
Speculative Decoding

Getting Started

Benchmark and Performance

Learn more about RTP-LLM's performance in our benchmark reports:

Performance Benchmark Tool

Acknowledgments

Our project is mainly based on FasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. We also draw inspiration from vllm, transformers, llava, and qwen-vl. We thank these projects for their inspiration and help.

Name		Name	Last commit message	Last commit date
Latest commit History 3,755 Commits
.githooks		.githooks
.github		.github
3rdparty		3rdparty
bazel		bazel
benchmark		benchmark
docker		docker
docs		docs
example		example
open_source		open_source
patches		patches
picture		picture
rtp_llm		rtp_llm
tests		tests
tools		tools
.bazelignore		.bazelignore
.bazeliskrc		.bazeliskrc
.bazelrc		.bazelrc
.clang-format		.clang-format
.cursor		.cursor
.dockerignore		.dockerignore
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
BUILD		BUILD
BUILD.aiter		BUILD.aiter
BUILD.pytorch		BUILD.pytorch
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README_cn.md		README_cn.md
WORKSPACE		WORKSPACE
def.bzl		def.bzl
deps		deps
internal_source		internal_source
pyrightconfig.json		pyrightconfig.json
stub_source		stub_source
workspace.bzl		workspace.bzl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

News

About

Key Features

🏢 Production Proven

⚡ High Performance

🔧 Flexibility and Ease of Use

🚀 Advanced Acceleration Techniques

Getting Started

Benchmark and Performance

Acknowledgments

Contact Us

DingTalk Group

WeChat Group

About

Uh oh!

Releases 15

Packages

Uh oh!

Contributors 68

Uh oh!

Languages

License

alibaba/rtp-llm

Folders and files

Latest commit

History

Repository files navigation

News

About

Key Features

🏢 Production Proven

⚡ High Performance

🔧 Flexibility and Ease of Use

🚀 Advanced Acceleration Techniques

Getting Started

Benchmark and Performance

Acknowledgments

Contact Us

DingTalk Group

WeChat Group

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Contributors 68

Uh oh!

Languages

Packages