Skip to content

Conversation

@gameofdimension
Copy link
Contributor

@gameofdimension gameofdimension commented Jan 4, 2026

Explicitly moving weights back to the CPU after computation is unnecessary—we can avoid it just like in the use_stream=True case. Since device-to-host copying is expensive, this change significantly improves inference speed when use_stream=False.

image

improvement

device: A100 40G
model: Qwen/Qwen-Image

step latency
baseline 54s
this pr 4.8s

test code

import time
from diffusers import QwenImagePipeline, QwenImageTransformer2DModel
import torch


def main():
    device = "cuda"
    model_name = "Qwen/Qwen-Image"
    torch_dtype = torch.bfloat16
    pipe: QwenImagePipeline = QwenImagePipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
    # pipe.enable_model_cpu_offload(device=device)

    offload_type = "block_level"
    num_blocks_per_group = 1
    use_stream = False
    assert isinstance(pipe.transformer, QwenImageTransformer2DModel)
    pipe.transformer.enable_group_offload(
        onload_device=device,
        offload_device="cpu",
        offload_type=offload_type,
        num_blocks_per_group=num_blocks_per_group,
        use_stream=use_stream,
    )
    pipe.to(device=device)

    positive_magic = {
        "en": ", Ultra HD, 4K, cinematic composition.",  # for english prompt
        "zh": ", 超清,4K,电影级构图.",  # for chinese prompt
    }

    # Generate image
    prompt = """A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197" perfect Ultra HD"""

    negative_prompt = (
        "very bad quality"  # using an empty string if you do not have specific concept to remove
    )

    # Generate with different aspect ratios
    aspect_ratios = {
        "1:1": (1328, 1328),
        "16:9": (1664, 928),
        "9:16": (928, 1664),
        "4:3": (1472, 1140),
        "3:4": (1140, 1472),
        "3:2": (1584, 1056),
        "2:3": (1056, 1584),
    }

    width, height = aspect_ratios["16:9"]
    generator = torch.Generator(device="cpu").manual_seed(42)

    image = pipe(
        prompt=prompt + positive_magic["en"],
        negative_prompt=negative_prompt,
        width=width,
        height=height,
        num_inference_steps=50,
        true_cfg_scale=4.0,
        generator=generator,
    ).images[0]

    image.save(f"example-{int(time.time())}.png")


if __name__ == "__main__":
    main()

What does this PR do?

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Refactor offloading logic to simplify memory management.
@gameofdimension
Copy link
Contributor Author

@DN6 @yiyixuxu Could you please take a look at this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant