mirror of
https://github.com/huggingface/diffusers.git
synced 2026-01-27 17:22:53 +03:00
@@ -14,15 +14,15 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:
|
||||
|
||||
* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora.md))
|
||||
* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora))
|
||||
* model compression (such as [SSD-1B](https://huggingface.co/segmind/SSD-1B))
|
||||
* reusing adjacent features of the denoiser (such as [DeepCache](https://github.com/horseee/DeepCache))
|
||||
|
||||
In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl.md) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
|
||||
In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
|
||||
|
||||
## Setup
|
||||
|
||||
Make sure you're on the latest version of `diffusers`:
|
||||
Make sure you're on the latest version of `diffusers`:
|
||||
|
||||
```bash
|
||||
pip install -U diffusers
|
||||
@@ -42,7 +42,7 @@ _This tutorial doesn't present the benchmarking code and focuses on how to perfo
|
||||
|
||||
## Baseline
|
||||
|
||||
Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0.md):
|
||||
Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0):
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
@@ -104,11 +104,11 @@ _(We later ran the experiments in float16 and found out that the recent versions
|
||||
* The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16.
|
||||
* Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.
|
||||
|
||||
We have a [dedicated guide](../optimization/fp16.md) for running inference in a reduced precision.
|
||||
We have a [dedicated guide](../optimization/fp16) for running inference in a reduced precision.
|
||||
|
||||
## Running attention efficiently
|
||||
|
||||
Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0.md), we can run them efficiently.
|
||||
Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0), we can run them efficiently.
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
@@ -200,7 +200,7 @@ It provides a minor boost from 2.54 seconds to 2.52 seconds.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky.md). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
|
||||
Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
Reference in New Issue
Block a user