Skip to content

An out-of-the-box inference acceleration engine for Diffusion and DiT models

License

Notifications You must be signed in to change notification settings

TMElyralab/lyraDiff

Repository files navigation

lyraDiff: An Out-of-the-box Acceleration Engine for Diffusion and DiT Models

lyraDiff introduces a recompilation-free inference engine for Diffusion and DiT models, achieving state-of-the-art speed, extensive model support, and pixel-level image consistency.

Highlights

  • State-of-the-art Inference Speed: lyraDiff utilizes multiple techniques to achieve up to 6.1x speedup of the model inference, including Quantization, Fused GEMM Kernels, Flash Attention, and NHWC & Fused GroupNorm.
  • Memory Efficiency: lyraDiff utilizes buffer-based DRAM reuse strategy and multiple types of quantizations (FP8/INT8/INT4) to save 10-40% of DRAM usage.
  • Extensive Model Support: lyraDiff supports a wide range of top Generative/SR models such as SD1.5, SDXL, FLUX, S3Diff, etc., and those most commonly used plugins such as LoRA, ControlNet and Ip-Adapter.
  • Zero Compilation Deployment: Unlike TensorRT or AITemplate, which takes minutes to compile, lyraDiff eliminates runtime recompilation overhead even with model inputs of dynamic shapes.
  • Image Gen Consistency: The outputs of lyraDiff are aligned with the ones of HF diffusers at the pixel level, even under LoRA switch in quantization mode.
  • Fast Plugin Hot-swap: lyraDiff provides Super Fast Model Hot-swap for ControlNet and LoRA which can hugely benefit a real-time image gen service.

Overview

NOTE:

  1. We only support GPUs with SM version >= 80. For example, Nvidia Ampere architecture (A2, A10, A16, A30, A40, A100), Ada Lovelace architecture (L20, RTX 4090), Hopper architecture (H20, H100, H800) and etc.
  2. The INT4 quantization is currently in the draft stage and has a lot of room for performance improvement.

Performance

Test environment

  • Nvidia driver version: 535.161.07
  • Nvidia cuda version: 12.4

SDXL performance (Device: A100-40G, Dtype: int8)

We test our acceleration to SDXL on these commonly used image generation shapes. We use HF diffusers with fp16 as baseline in comparison to lyraDiff with INT8 SmoothQuant quantization method, since INT8 SmoothQuant presents acceptable difference to original images on A100 and gives better performance.

Image Size Steps Time cost(s) Time cost(s) (lyra) Speed Up
512x512 20 1.41 0.49 2.9x
768x768 20 1.53 0.89 1.7x
1024x1024 20 2.30 1.44 1.6x

FLUX performance

We test our acceleration to FLUX.1-dev on these commonly used image generation shapes. We use HF diffusers with FP8-WO as baseline in comparison to lyraDiff with FP8 quantization method, since FP8 presents almost no difference to original images on L20 and 4090 and gives better performance.

Device: 4090-24G, Dtype: fp8

Image Size Steps Time cost(s) Time cost(s) (lyra) Speed Up
512x512 20 3.81 2.16 1.8x
768x768 20 7.55 4.13 1.8x
1024x1024 20 13.48 7.22 1.9x
1280x1280 20 22.77 13.14 1.7x

Device: L20-48G, Dtype: fp8

Image Size Steps Time cost(s) Time cost(s) (lyra) Speed Up
512x512 20 5.54 2.91 1.9x
768x768 20 10.22 5.64 1.8x
1024x1024 20 18.34 10.28 1.8x
1280x1280 20 31.47 17.64 1.8x

S3Diff performance (Device: A10-24G, Dtype: fp16)

We test our acceleration to S3Diff on these commonly used image SR shapes. We use the original pipeline with fp16 as baseline in comparison to lyraDiff on A10.

Image Size SR Ratio Time cost(s) (ori) Time cost(s) (lyra) Speed Up
128x128 4 0.68 0.17 4.0x
512x512 2 2.10 0.86 2.4x
720x1280 2 16.64 3.36 4.9x
1024x1024 2 19.90 3.75 5.3x
1920x1080 2 42.24 6.91 6.1x

Model swap performance (Device: A100-40G, Dtype: fp16)

Model Plugin Time cost(s) Time cost(s) (lyra) Speed Up
SD1.5 LoRA 1.46 0.14 10.4x
SD1.5 ControlNet 0.72 0.64 1.1x
SDXL LoRA 4.27 0.63 6.8x
SDXL ControlNet 2.08 0.99 2.1x

Getting Started

Method 1: Using Docker (Recommended)

You can use the docker image we provide ccr.ccs.tencentyun.com/tme-lyralab/lyradiff_workspace:1.0.0.

docker run --gpus all -it --rm --name lyraDiff_workspace ccr.ccs.tencentyun.com/tme-lyralab/lyradiff_workspace:1.0.0
# lyradiff python lib is installed and libth_lyradiff.so is located at /workspace/lyradiff_libs/libth_lyradiff.so where lyradiff will load by default
# You can export LYRA_LIB_PATH="/path/to/libth_lyradiff.so" to override the default setting

Method 2: Build from source

Nvidia's standard NGC docker images are highly recommended.

For example, nvcr.io/nvidia/pytorch:24.08-py3.

docker run --gpus all -it --rm --name lyraDiff_workspace  nvcr.io/nvidia/pytorch:24.08-py3

cd </path/to/lyraDiff/workspace/>

# Step 1: Install lyradiff python lib
pip install -e .

# Step 2: Build libth_lyradiff.so
git submodule init && git submodule update  

mkdir build && cd build

# Please set the correct sm version
cmake -DSM=80 -DCXX_STD=17 -DBUILD_PYT=ON  -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON ..

make -j16

# You will find libth_lyradiff.so in build/lib

# set LYRA_LIB_PATH to override default LYRA_LIB_PATH
export LYRA_LIB_PATH="/path/to/libth_lyradiff.so"

Usage Example

In examples, we provide minimal scripts for running diffusion models with lyraDiff. For example, the script for FLUX.1-dev is as follows:

import torch
from diffusers import FluxPipeline
import os
from lyradiff.lyradiff_model.lyradiff_flux_transformer_model_v2 import LyraDiffFluxTransformer2DModelV2
from lyradiff.lyradiff_model.lyradiff_utils import LyraQuantLevel

model_path = "/path/to/lyraDiff-FLUX.1-dev/"
quant_level = LyraQuantLevel.NONE
transformer_model = LyraDiffFluxTransformer2DModelV2(quant_level=quant_level)
transformer_model.load_from_diffusers_model(os.path.join(model_path, "transformer"))

model = FluxPipeline.from_pretrained(model_path, transformer=None, torch_dtype=torch.bfloat16).to("cuda")
model.transformer = transformer_model

# Image Gen
image = model("Female furry Pixie with text hello world",
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=20,
    max_sequence_length=512,
    generator=torch.Generator("cuda").manual_seed(123)
).images[0]
image.save("flux.1-dev.png")

To check examples for more supported models and more details, please refer to following README:

RoadMap

This roadmap outlines our key development goals for March-April 2025. Contributions and feedback are always welcome!

Documentation (expected by April 14th)

  • Code Structure Documentation
  • Detailed Model API Documentation
  • Contribution Guide

UI

  • Add ComfyUI node for FLUX.1

Models

  • Support more plugins for FLUX.1
  • Better Performance for SVDQuant (Int4) Inference

Citation

@Misc{lyraDiff_2025,
  author =       {Yibo Lu, Sa Xiao, Kangjian Wu, Bin Wu, Mian Peng, Haoxiong Su, Qiwen Mao, Wenjiang Zhou},
  title =        {lyraDiff: An Out-of-the-box Acceleration Engine for Diffusion and DiT Models},
  howpublished = {\url{https://github.com/TMElyralab/lyraDiff}},
  year =         {2025}
}

Related Projects

Acknowledgments

lyraDiff is inspired by many open-source libraries, including (but not limited to) FasterTransformer, diffusers, S3Diff, TensorRT-LLM, cutlass, stable-fast, oneflow, deepcompressor, TensorRT-Model-Optimizer, flash-attention, and AITemplate.

About

An out-of-the-box inference acceleration engine for Diffusion and DiT models

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages