From 9d767916dac63a2b425936c49cd149c284000d05 Mon Sep 17 00:00:00 2001
From: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Date: Tue, 9 Jan 2024 08:08:31 -0800
Subject: [PATCH] [docs] Fast diffusion (#6470)

* edits

* fix

* feedback
---
 docs/source/en/tutorials/fast_diffusion.md | 190 +++++++++------------
 1 file changed, 84 insertions(+), 106 deletions(-)
diff --git a/docs/source/en/tutorials/fast_diffusion.md b/docs/source/en/tutorials/fast_diffusion.md
index cc6db4af42..50ea349331 100644
--- a/docs/source/en/tutorials/fast_diffusion.md
+++ b/docs/source/en/tutorials/fast_diffusion.md
@@ -12,17 +12,11 @@ specific language governing permissions and limitations under the License.
 
 # Accelerate inference of text-to-image diffusion models
 
-Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:
+Diffusion models are slower than their GAN counterparts because of the iterative and sequential reverse diffusion process. There are several techniques that can address this limitation such as progressive timestep distillation ([LCM LoRA](../using-diffusers/inference_with_lcm_lora)), model compression ([SSD-1B](https://huggingface.co/segmind/SSD-1B)), and reusing adjacent features of the denoiser ([DeepCache](../optimization/deepcache)).
 
-* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora))
-* model compression (such as [SSD-1B](https://huggingface.co/segmind/SSD-1B))
-* reusing adjacent features of the denoiser (such as [DeepCache](https://github.com/horseee/DeepCache))
+However, you don't necessarily need to use these techniques to speed up inference. With PyTorch 2 alone, you can accelerate the inference latency of text-to-image diffusion pipelines by up to 3x. This tutorial will show you how to progressively apply the optimizations found in PyTorch 2 to reduce inference latency. You'll use the [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) pipeline in this tutorial, but these techniques are applicable to other text-to-image diffusion pipelines too.
 
-In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
-
-## Setup
-
-Make sure you're on the latest version of `diffusers`:
+Make sure you're using the latest version of Diffusers:
 
 ```bash
 pip install -U diffusers
@@ -34,15 +28,23 @@ Then upgrade the other required libraries too:
 pip install -U transformers accelerate peft
 ```
 
-To benefit from the fastest kernels, use PyTorch nightly. You can find the installation instructions [here](https://pytorch.org/). 
+Install [PyTorch nightly](https://pytorch.org/) to benefit from the latest and fastest kernels:
 
-To report the numbers shown below, we used an 80GB 400W A100 with its clock rate set to the maximum.
+```bash
+pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
+```
 
-_This tutorial doesn't present the benchmarking code and focuses on how to perform the optimizations, instead. For the full benchmarking code, refer to: [https://github.com/huggingface/diffusion-fast](https://github.com/huggingface/diffusion-fast)._
+<Tip>
+
+The results reported below are from a 80GB 400W A100 with its clock rate set to the maximum. <br>
+
+If you're interested in the full benchmarking code, take a look at [huggingface/diffusion-fast](https://github.com/huggingface/diffusion-fast).
+
+</Tip>
 
 ## Baseline
 
-Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0):
+Let's start with a baseline. Disable reduced precision and the [`scaled_dot_product_attention` (SDPA)](../optimization/torch2.0#scaled-dot-product-attention) function which is automatically used by Diffusers:
 
 ```python
 from diffusers import StableDiffusionXLPipeline
@@ -52,7 +54,7 @@ pipe = StableDiffusionXLPipeline.from_pretrained(
     "stabilityai/stable-diffusion-xl-base-1.0"
 ).to("cuda")
 
-# Run the attention ops without efficiency.
+# Run the attention ops without SDPA.
 pipe.unet.set_default_attn_processor()
 pipe.vae.set_default_attn_processor()
 
@@ -60,27 +62,29 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 image = pipe(prompt, num_inference_steps=30).images[0]
 ```
 
-This takes 7.36 seconds:
-
-<div align="center">
-
-<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_0.png" width=500>
+This default setup takes 7.36 seconds.
 
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_0.png" width=500>
 </div>
 
-## Running inference in bfloat16
+## bfloat16
 
-Enable the first optimization: use a reduced precision to run the inference. 
+Enable the first optimization, reduced precision or more specifically bfloat16. There are several benefits of using reduced precision:
+
+* Using a reduced numerical precision (such as float16 or bfloat16) for inference doesn’t affect the generation quality but significantly improves latency.
+* The benefits of using bfloat16 compared to float16 are hardware dependent, but modern GPUs tend to favor bfloat16.
+* bfloat16 is much more resilient when used with quantization compared to float16, but more recent versions of the quantization library ([torchao](https://github.com/pytorch-labs/ao)) we used don't have numerical issues with float16.
 
 ```python
 from diffusers import StableDiffusionXLPipeline
 import torch
 
 pipe = StableDiffusionXLPipeline.from_pretrained(
-	"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
 ).to("cuda")
 
-# Run the attention ops without efficiency.
+# Run the attention ops without SDPA.
 pipe.unet.set_default_attn_processor()
 pipe.vae.set_default_attn_processor()
 
@@ -88,51 +92,45 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 image = pipe(prompt, num_inference_steps=30).images[0]
 ```
 
-bfloat16 reduces the latency from 7.36 seconds to 4.63 seconds:
-
-<div align="center">
-
-<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_1.png" width=500>
+bfloat16 reduces the latency from 7.36 seconds to 4.63 seconds.
 
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_1.png" width=500>
 </div>
 
-_(We later ran the experiments in float16 and found out that the recent versions of torchao do not incur numerical problems from float16.)_
+<Tip>
 
-**Why bfloat16?** 
+In our later experiments with float16, recent versions of torchao do not incur numerical problems from float16.
 
-* Using a reduced numerical precision (such as float16, bfloat16) to run inference doesn’t affect the generation quality but significantly improves latency. 
-* The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16. 
-* Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.  
+</Tip>
 
-We have a [dedicated guide](../optimization/fp16) for running inference in a reduced precision. 
+Take a look at the [Speed up inference](../optimization/fp16) guide to learn more about running inference with reduced precision.
 
-## Running attention efficiently
+## SDPA
 
-Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0), we can run them efficiently. 
+Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0#scaled-dot-product-attention) function, it is a lot more efficient. This function is used by default in Diffusers so you don't need to make any changes to the code.
 
 ```python
 from diffusers import StableDiffusionXLPipeline
 import torch
 
 pipe = StableDiffusionXLPipeline.from_pretrained(
-	"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
 ).to("cuda")
 
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 image = pipe(prompt, num_inference_steps=30).images[0]
 ```
 
-`scaled_dot_product_attention` improves the latency from 4.63 seconds to 3.31 seconds.
-
-<div align="center">
-
-<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_2.png" width=500>
+Scaled dot product attention improves the latency from 4.63 seconds to 3.31 seconds.
 
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_2.png" width=500>
 </div>
 
-## Use faster kernels with torch.compile
+## torch.compile
 
-Compile the UNet and the VAE to benefit from the faster kernels. First, configure a few compiler flags:
+PyTorch 2 includes `torch.compile` which uses fast and optimized kernels. In Diffusers, the UNet and VAE are usually compiled because these are the most compute-intensive modules. First, configure a few compiler flags (refer to the [full list](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py) for more options):
 
 ```python
 from diffusers import StableDiffusionXLPipeline
@@ -144,16 +142,14 @@ torch._inductor.config.epilogue_fusion = False
 torch._inductor.config.coordinate_descent_check_all_directions = True
 ```
 
-For the full list of compiler flags, refer to [this file](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py).
-
-It is also important to change the memory layout of the UNet and the VAE to “channels_last” when compiling them. This ensures maximum speed:
+It is also important to change the UNet and VAE's memory layout to "channels_last" when compiling them to ensure maximum speed.
 
 ```python
 pipe.unet.to(memory_format=torch.channels_last)
 pipe.vae.to(memory_format=torch.channels_last)
 ```
 
-Then, compile and perform inference:
+Now compile and perform inference:
 
 ```python
 # Compile the UNet and VAE.
@@ -162,59 +158,65 @@ pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=
 
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 
-# First call to `pipe` will be slow, subsequent ones will be faster.
+# First call to `pipe` is slow, subsequent ones are faster.
 image = pipe(prompt, num_inference_steps=30).images[0]
 ```
 
-`torch.compile` offers different backends and modes. As we’re aiming for maximum inference speed, we opt for the inductor backend using the “max-autotune”. “max-autotune” uses CUDA graphs and optimizes the compilation graph specifically for latency. Using CUDA graphs greatly reduces the overhead of launching GPU operations. It saves time by using a mechanism to launch multiple GPU operations through a single CPU operation.
+`torch.compile` offers different backends and modes. For maximum inference speed, use "max-autotune" for the inductor backend. “max-autotune” uses CUDA graphs and optimizes the compilation graph specifically for latency. CUDA graphs greatly reduces the overhead of launching GPU operations by using a mechanism to launch multiple GPU operations through a single CPU operation.
 
-Specifying fullgraph to be True ensures that there are no graph breaks in the underlying model, ensuring the fullest potential of `torch.compile`. 
-
-Using SDPA attention and compiling both the UNet and VAE reduces the latency from 3.31 seconds to 2.54 seconds. 
-
-<div align="center">
-
-<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_3.png" width=500>
+Using SDPA attention and compiling both the UNet and VAE cuts the latency from 3.31 seconds to 2.54 seconds.
 
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_3.png" width=500>
 </div>
 
-## Combine the projection matrices of attention
+### Prevent graph breaks
 
-Both the UNet and the VAE used in SDXL make use of Transformer-like blocks. A Transformer block consists of attention blocks and feed-forward blocks. 
+Specifying `fullgraph=True` ensures there are no graph breaks in the underlying model to take full advantage of `torch.compile` without any performance degradation. For the UNet and VAE, this means changing how you access the return variables.
 
-In an attention block, the input is projected into three sub-spaces using three different projection matrices – Q, K, and V. In the naive implementation, these projections are performed separately on the input. But we can horizontally combine the projection matrices into a single matrix and perform the projection in one shot. This increases the size of the matmuls of the input projections and improves the impact of quantization (to be discussed next). 
+```diff
+- latents = unet(
+-   latents, timestep=timestep, encoder_hidden_states=prompt_embeds
+-).sample
 
-Enabling this kind of computation in Diffusers just takes a single line of code:
++ latents = unet(
++   latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False
++)[0]
+```
+
+### Remove GPU sync after compilation
+
+During the iterative reverse diffusion process, the `step()` function is [called](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1228) on the scheduler each time after the denoiser predicts the less noisy latent embeddings. Inside `step()`, the `sigmas` variable is [indexed](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/schedulers/scheduling_euler_discrete.py#L476) which when placed on the GPU, causes a communication sync between the CPU and GPU. This introduces latency and it becomes more evident when the denoiser has already been compiled.
+
+But if the `sigmas` array always [stays on the CPU](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240), the CPU and GPU sync doesn’t occur and you don't get any latency. In general, any CPU and GPU communication sync should be none or be kept to a bare minimum because it can impact inference latency.
+
+## Combine the attention block's projection matrices
+
+The UNet and VAE in SDXL use Transformer-like blocks which consists of attention blocks and feed-forward blocks.
+
+In an attention block, the input is projected into three sub-spaces using three different projection matrices – Q, K, and V. These projections are performed separately on the input. But we can horizontally combine the projection matrices into a single matrix and perform the projection in one step. This increases the size of the matrix multiplications of the input projections and improves the impact of quantization.
+
+You can combine the projection matrices with just a single line of code:
 
 ```python
 pipe.fuse_qkv_projections()
 ```
 
-It provides a minor boost from 2.54 seconds to 2.52 seconds. 
-
-<div align="center">
-
-<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_4.png" width=500>
+This provides a minor improvement from 2.54 seconds to 2.52 seconds.
 
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_4.png" width=500>
 </div>
 
 <Tip warning={true}>
 
-Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
+Support for [`~StableDiffusionXLPipeline.fuse_qkv_projections`] is limited and experimental. It's not available for many non-Stable Diffusion pipelines such as [Kandinsky](../using-diffusers/kandinsky). You can refer to this [PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to enable this for the other pipelines.
 
 </Tip>
 
 ## Dynamic quantization
 
-Aapply [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to both the UNet and the VAE. This is because quantization adds additional conversion overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization). If the matmuls are too small, these techniques may degrade performance.
-
-<Tip>
-
-Through experimentation, we found that certain linear layers in the UNet and the VAE don’t benefit from dynamic int8 quantization. You can check out the full code for filtering those layers [here](https://github.com/huggingface/diffusion-fast/blob/0f169640b1db106fe6a479f78c1ed3bfaeba3386/utils/pipeline_utils.py#L16) (referred to as `dynamic_quant_filter_fn` below). 
-
-</Tip>
-
-You will leverage the ultra-lightweight pure PyTorch library [torchao](https://github.com/pytorch-labs/ao) (commit SHA: 54bcd5a10d0abbe7b0c045052029257099f83fd9) to use its user-friendly APIs for quantization. 
+You can also use the ultra-lightweight PyTorch quantization library, [torchao](https://github.com/pytorch-labs/ao) (commit SHA `54bcd5a10d0abbe7b0c045052029257099f83fd9`), to apply [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to the UNet and VAE. Quantization adds additional conversion overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization). If the matmuls are too small, these techniques may degrade performance.
 
 First, configure all the compiler tags:
 
@@ -231,7 +233,7 @@ torch._inductor.config.force_fuse_int_mm_with_mul = True
 torch._inductor.config.use_mixed_mm = True
 ```
 
-Define the filtering functions:
+Certain linear layers in the UNet and VAE don’t benefit from dynamic int8 quantization. You can filter out those layers with the [`dynamic_quant_filter_fn`](https://github.com/huggingface/diffusion-fast/blob/0f169640b1db106fe6a479f78c1ed3bfaeba3386/utils/pipeline_utils.py#L16) shown below.
 
 ```python
 def dynamic_quant_filter_fn(mod, *args):
@@ -269,12 +271,12 @@ def conv_filter_fn(mod, *args):
     )
 ```
 
-Then apply all the optimizations discussed so far:
+Finally, apply all the optimizations discussed so far:
 
 ```python
 # SDPA + bfloat16.
 pipe = StableDiffusionXLPipeline.from_pretrained(
-	"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
 ).to("cuda")
 
 # Combine attention projection matrices.
@@ -285,7 +287,7 @@ pipe.unet.to(memory_format=torch.channels_last)
 pipe.vae.to(memory_format=torch.channels_last)
 ```
 
-Since this quantization support is limited to linear layers only, we also turn suitable pointwise convolution layers into linear layers to maximize the benefit.
+Since dynamic quantization is only limited to the linear layers, convert the appropriate pointwise convolution layers into linear layers to maximize its benefit.
 
 ```python
 from torchao import swap_conv2d_1x1_to_linear
@@ -315,30 +317,6 @@ image = pipe(prompt, num_inference_steps=30).images[0]
 
 Applying dynamic quantization improves the latency from 2.52 seconds to 2.43 seconds.
 
-<div align="center">
-
-<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_5.png" width=500>
-
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_5.png" width=500>
 </div>
-
-## Misc
-
-### No graph breaks during torch.compile
-
-Ensuring that the underlying model/method can be fully compiled is crucial for performance (torch.compile with fullgraph=True). This means having no graph breaks. We did this for the UNet and VAE by changing how we access the returning variables. Consider the following example: 
-
-```diff
-- latents = unet(
--	latents, timestep=timestep, encoder_hidden_states=prompt_embeds
--).sample
-
-+ latents = unet(
-+	latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False
-+)[0]
-```
-
-### Getting rid of GPU syncs after compilation
-
-During the iterative reverse diffusion process, we [call](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1228) `step()` on the scheduler each time after the denoiser predicts the less noisy latent embeddings. Inside `step()`, the `sigmas` variable is [indexed](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/schedulers/scheduling_euler_discrete.py#L476). If the `sigmas` array is placed on the GPU, indexing causes a communication sync between the CPU and GPU. This causes a latency, and it becomes more evident when the denoiser has already been compiled. 
-
-But if the `sigmas` array always stays on the CPU (refer to [this line](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240)), this sync doesn’t take place, hence improved latency. In general, any CPU <-> GPU communication sync should be none or be kept to a bare minimum as it can impact inference latency.