From 7f1a42d5eea12ae818091c9ba8c1573fdb5ecd84 Mon Sep 17 00:00:00 2001 From: stevhliu Date: Mon, 30 Jun 2025 10:01:05 -0700 Subject: [PATCH 1/3] add blog post --- docs/source/en/optimization/fp16.md | 33 +++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/docs/source/en/optimization/fp16.md b/docs/source/en/optimization/fp16.md index 734f63e68d23..45a35e784e74 100644 --- a/docs/source/en/optimization/fp16.md +++ b/docs/source/en/optimization/fp16.md @@ -174,11 +174,18 @@ Feel free to open an issue if dynamic compilation doesn't work as expected for a ### Regional compilation +[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) trims cold-start latency by only compiling the *small and frequently-repeated block(s)* of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. +For many diffusion architectures, this delivers the same runtime speed-ups as full-graph compilation and reduces compile time by 8–10x. -[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) trims cold-start latency by compiling **only the small, frequently-repeated block(s)** of a model, typically a Transformer layer, enabling reuse of compiled artifacts for every subsequent occurrence. -For many diffusion architectures this delivers the *same* runtime speed-ups as full-graph compilation yet cuts compile time by **8–10 ×**. +There are two implementations of regional compilation. -To make this effortless, [`ModelMixin`] exposes [`ModelMixin.compile_repeated_blocks`] API, a helper that wraps `torch.compile` around any sub-modules you designate as repeatable: +- The Diffusers version, [`~ModelMixin.compile_repeated_blocks`], is more explicit and is easier to customize. +- The Accelerate version, [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78), automatically selects which regions to compile and is less customizable. It is ideal for fast experiments. + + + + +Use the [`~ModelMixin.compile_repeated_blocks`] method, a helper that wraps `torch.compile`, on any component such as the transformer model as shown below. ```py # pip install -U diffusers @@ -194,19 +201,20 @@ pipe = StableDiffusionXLPipeline.from_pretrained( pipe.unet.compile_repeated_blocks(fullgraph=True) ``` -To enable a new model with regional compilation, add a `_repeated_blocks` attribute to your model class containing the class names (as strings) of the blocks you want compiled: - +To enable regional compilation for a new model, add a `_repeated_blocks` attribute to a model class containing the class names (as strings) of the blocks you want to compile. ```py class MyUNet(ModelMixin): _repeated_blocks = ("Transformer2DModel",) # ← compiled by default ``` -For more examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705). - -**Relation to Accelerate compile_regions** There is also a separate API in [accelerate](https://huggingface.co/docs/accelerate/index) - [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78). It takes a fully automatic approach: it walks the module, picks candidate blocks, then compiles the remaining graph separately. That hands-off experience is handy for quick experiments, but it also leaves fewer knobs when you want to fine-tune which blocks are compiled or adjust compilation flags. +> [!TIP] +> For more examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705). + + +There is also a [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method in [Accelerate](https://huggingface.co/docs/accelerate/index) that automatically selects candidate blocks in a model to compile. The remaining graph is compiled separately. This is useful for quick experiments because there aren't as many options for you to set which blocks to compile or adjust compilation flags. ```py # pip install -U accelerate @@ -219,8 +227,11 @@ pipeline = StableDiffusionXLPipeline.from_pretrained( ).to("cuda") pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph=True) ``` -`compile_repeated_blocks`, by contrast, is intentionally explicit. You list the repeated blocks once (via `_repeated_blocks`) and the helper compiles exactly those, nothing more. In practice this small dose of control hits a sweet spot for diffusion models: predictable behavior, easy reasoning about cache reuse, and still a one-liner for users. +[`~ModelMixin.compile_repeated_blocks`] is intentionally explicit. List the blocks to repeat in `_repeated_blocks` and the helper only compiles those blocks. It offers predictable behavior and easy reasoning about cache reuse in one line of code. + + + ### Graph breaks @@ -296,3 +307,7 @@ An input is projected into three subspaces, represented by the projection matric ```py pipeline.fuse_qkv_projections() ``` + +## Resources + +Read the [Presenting Flux Fast: Making Flux go brrr on H100s](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/) blog post to learn more about how you can combine all of these optimizations with [TorchInductor](https://docs.pytorch.org/docs/stable/torch.compiler.html) and [AOTInductor](https://docs.pytorch.org/docs/stable/torch.compiler_aot_inductor.html) for a ~2.5x speedup. \ No newline at end of file From 46e392c989f598cd92eb38c5ab1818a43b1fc812 Mon Sep 17 00:00:00 2001 From: stevhliu Date: Tue, 1 Jul 2025 12:56:25 -0700 Subject: [PATCH 2/3] feedback --- docs/source/en/optimization/fp16.md | 28 ++++++++-------------------- 1 file changed, 8 insertions(+), 20 deletions(-) diff --git a/docs/source/en/optimization/fp16.md b/docs/source/en/optimization/fp16.md index 45a35e784e74..edbb14fae3d4 100644 --- a/docs/source/en/optimization/fp16.md +++ b/docs/source/en/optimization/fp16.md @@ -175,15 +175,7 @@ Feel free to open an issue if dynamic compilation doesn't work as expected for a ### Regional compilation [Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) trims cold-start latency by only compiling the *small and frequently-repeated block(s)* of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. -For many diffusion architectures, this delivers the same runtime speed-ups as full-graph compilation and reduces compile time by 8–10x. - -There are two implementations of regional compilation. - -- The Diffusers version, [`~ModelMixin.compile_repeated_blocks`], is more explicit and is easier to customize. -- The Accelerate version, [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78), automatically selects which regions to compile and is less customizable. It is ideal for fast experiments. - - - +For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. Use the [`~ModelMixin.compile_repeated_blocks`] method, a helper that wraps `torch.compile`, on any component such as the transformer model as shown below. @@ -192,13 +184,13 @@ Use the [`~ModelMixin.compile_repeated_blocks`] method, a helper that wraps `tor import torch from diffusers import StableDiffusionXLPipeline -pipe = StableDiffusionXLPipeline.from_pretrained( +pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, ).to("cuda") -# Compile only the repeated Transformer layers inside the UNet -pipe.unet.compile_repeated_blocks(fullgraph=True) +# compile only the repeated transformer layers inside the UNet +pipeline.unet.compile_repeated_blocks(fullgraph=True) ``` To enable regional compilation for a new model, add a `_repeated_blocks` attribute to a model class containing the class names (as strings) of the blocks you want to compile. @@ -209,10 +201,7 @@ class MyUNet(ModelMixin): ``` > [!TIP] -> For more examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705). - - - +> For more regional compilation examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705). There is also a [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method in [Accelerate](https://huggingface.co/docs/accelerate/index) that automatically selects candidate blocks in a model to compile. The remaining graph is compiled separately. This is useful for quick experiments because there aren't as many options for you to set which blocks to compile or adjust compilation flags. @@ -230,9 +219,6 @@ pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph [`~ModelMixin.compile_repeated_blocks`] is intentionally explicit. List the blocks to repeat in `_repeated_blocks` and the helper only compiles those blocks. It offers predictable behavior and easy reasoning about cache reuse in one line of code. - - - ### Graph breaks It is important to specify `fullgraph=True` in torch.compile to ensure there are no graph breaks in the underlying model. This allows you to take advantage of torch.compile without any performance degradation. For the UNet and VAE, this changes how you access the return variables. @@ -310,4 +296,6 @@ pipeline.fuse_qkv_projections() ## Resources -Read the [Presenting Flux Fast: Making Flux go brrr on H100s](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/) blog post to learn more about how you can combine all of these optimizations with [TorchInductor](https://docs.pytorch.org/docs/stable/torch.compiler.html) and [AOTInductor](https://docs.pytorch.org/docs/stable/torch.compiler_aot_inductor.html) for a ~2.5x speedup. \ No newline at end of file +- Read the [Presenting Flux Fast: Making Flux go brrr on H100s](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/) blog post to learn more about how you can combine all of these optimizations with [TorchInductor](https://docs.pytorch.org/docs/stable/torch.compiler.html) and [AOTInductor](https://docs.pytorch.org/docs/stable/torch.compiler_aot_inductor.html) for a ~2.5x speedup using recipes from [flux-fast](https://github.com/huggingface/flux-fast). + + These recipes support AMD hardware and [Flux.1 Kontext Dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev). \ No newline at end of file From a115aa903608f82e0193035c985bd92d8de77223 Mon Sep 17 00:00:00 2001 From: stevhliu Date: Fri, 11 Jul 2025 10:10:27 -0700 Subject: [PATCH 3/3] feedback --- docs/source/en/optimization/speed-memory-optims.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/source/en/optimization/speed-memory-optims.md b/docs/source/en/optimization/speed-memory-optims.md index 4a76d272cf90..f43e60bc7489 100644 --- a/docs/source/en/optimization/speed-memory-optims.md +++ b/docs/source/en/optimization/speed-memory-optims.md @@ -14,6 +14,9 @@ specific language governing permissions and limitations under the License. Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading). +> [!TIP] +> Check the [torch.compile](./fp16#torchcompile) guide to learn more about compilation and how they can be applied here. For example, regional compilation can significantly reduce compilation time without giving up any speedups. + For image generation, combining quantization and [model offloading](./memory#model-offloading) can often give the best trade-off between quality, speed, and memory. Group offloading is not as effective for image generation because it is usually not possible to *fully* overlap data transfer if the compute kernel finishes faster. This results in some communication overhead between the CPU and GPU. For video generation, combining quantization and [group-offloading](./memory#group-offloading) tends to be better because video models are more compute-bound. @@ -25,7 +28,7 @@ The table below provides a comparison of optimization strategy combinations and | quantization | 32.602 | 14.9453 | | quantization, torch.compile | 25.847 | 14.9448 | | quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 | -These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the if you're interested in evaluating your own model. +These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) if you're interested in evaluating your own model. This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes.