mirror of
https://github.com/huggingface/diffusers.git
synced 2026-01-27 17:22:53 +03:00
200 lines
8.4 KiB
Markdown
200 lines
8.4 KiB
Markdown
<!--版权所有 2024 The HuggingFace Team。保留所有权利。
|
||
|
||
根据 Apache 许可证 2.0 版(“许可证”)授权;除非符合许可证,否则不得使用此文件。
|
||
您可以在以下网址获取许可证副本:
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
除非适用法律要求或书面同意,根据许可证分发的软件按“原样”分发,不附带任何明示或暗示的担保或条件。有关许可证的特定语言,请参阅许可证。
|
||
-->
|
||
|
||
# 编译和卸载量化模型
|
||
|
||
优化模型通常涉及[推理速度](./fp16)和[内存使用](./memory)之间的权衡。例如,虽然[缓存](./cache)可以提高推理速度,但它也会增加内存消耗,因为它需要存储中间注意力层的输出。一种更平衡的优化策略结合了量化模型、[torch.compile](./fp16#torchcompile) 和各种[卸载方法](./memory#offloading)。
|
||
|
||
> [!TIP]
|
||
> 查看 [torch.compile](./fp16#torchcompile) 指南以了解更多关于编译以及如何在此处应用的信息。例如,区域编译可以显著减少编译时间,而不会放弃任何加速。
|
||
|
||
对于图像生成,结合量化和[模型卸载](./memory#model-offloading)通常可以在质量、速度和内存之间提供最佳权衡。组卸载对于图像生成效果不佳,因为如果计算内核更快完成,通常不可能*完全*重叠数据传输。这会导致 CPU 和 GPU 之间的一些通信开销。
|
||
|
||
对于视频生成,结合量化和[组卸载](./memory#group-offloading)往往更好,因为视频模型更受计算限制。
|
||
|
||
下表提供了优化策略组合及其对 Flux 延迟和内存使用的影响的比较。
|
||
|
||
| 组合 | 延迟 (s) | 内存使用 (GB) |
|
||
|---|---|---|
|
||
| 量化 | 32.602 | 14.9453 |
|
||
| 量化, torch.compile | 25.847 | 14.9448 |
|
||
| 量化, torch.compile, 模型 CPU 卸载 | 32.312 | 12.2369 |
|
||
<small>这些结果是在 Flux 上使用 RTX 4090 进行基准测试的。transformer 和 text_encoder 组件已量化。如果您有兴趣评估自己的模型,请参考[基准测试脚本](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d)。</small>
|
||
|
||
本指南将向您展示如何使用 [bitsandbytes](../quantization/bitsandbytes#torchcompile) 编译和卸载量化模型。确保您正在使用 [PyTorch nightly](https://pytorch.org/get-started/locally/) 和最新版本的 bitsandbytes。
|
||
|
||
```bash
|
||
pip install -U bitsandbytes
|
||
```
|
||
|
||
## 量化和 torch.compile
|
||
|
||
首先通过[量化](../quantization/overview)模型来减少存储所需的内存,并[编译](./fp16#torchcompile)它以加速推理。
|
||
|
||
配置 [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` 以在编译 bitsandbytes 模型时处理动态输出。
|
||
|
||
```py
|
||
import torch
|
||
from diffusers import DiffusionPipeline
|
||
from diffusers.quantizers import PipelineQuantizationConfig
|
||
|
||
torch._dynamo.config.capture_dynamic_output_shape_ops = True
|
||
|
||
# 量化
|
||
pipeline_quant_config = PipelineQuantizationConfig(
|
||
quant_backend="bitsandbytes_4bit",
|
||
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
|
||
components_to_quantize=["transformer", "text_encoder_2"],
|
||
)
|
||
pipeline = DiffusionPipeline.from_pretrained(
|
||
"black-forest-labs/FLUX.1-dev",
|
||
quantization_config=pipeline_quant_config,
|
||
torch_dtype=torch.bfloat16,
|
||
).to("cuda")
|
||
|
||
# 编译
|
||
pipeline.transformer.to(memory_format=torch.channels_last)
|
||
pipeline.transformer.compile(mode="max-autotune", fullgraph=True)
|
||
pipeline("""
|
||
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
|
||
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
|
||
"""
|
||
).images[0]
|
||
```
|
||
|
||
## 量化、torch.compile 和卸载
|
||
|
||
除了量化和 torch.compile,如果您需要进一步减少内存使用,可以尝试卸载。卸载根据需要将各种层或模型组件从 CPU 移动到 GPU 进行计算。
|
||
|
||
在卸载期间配置 [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` 以避免过多的重新编译,并设置 `capture_dynamic_output_shape_ops = True` 以在编译 bitsandbytes 模型时处理动态输出。
|
||
|
||
<hfoptions id="offloading">
|
||
<hfoption id="model CPU offloading">
|
||
|
||
[模型 CPU 卸载](./memory#model-offloading) 将单个管道组件(如 transformer 模型)在需要计算时移动到 GPU。否则,它会被卸载到 CPU。
|
||
|
||
```py
|
||
import torch
|
||
from diffusers import DiffusionPipeline
|
||
from diffusers.quantizers import PipelineQuantizationConfig
|
||
|
||
torch._dynamo.config.cache_size_limit = 1000
|
||
torch._dynamo.config.capture_dynamic_output_shape_ops = True
|
||
|
||
# 量化
|
||
pipeline_quant_config = PipelineQuantizationConfig(
|
||
quant_backend="bitsandbytes_4bit",
|
||
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
|
||
components_to_quantize=["transformer", "text_encoder_2"],
|
||
)
|
||
pipeline = DiffusionPipeline.from_pretrained(
|
||
"black-forest-labs/FLUX.1-dev",
|
||
quantization_config=pipeline_quant_config,
|
||
torch_dtype=torch.bfloat16,
|
||
).to("cuda")
|
||
|
||
# 模型 CPU 卸载
|
||
pipeline.enable_model_cpu_offload()
|
||
|
||
# 编译
|
||
pipeline.transformer.compile()
|
||
pipeline(
|
||
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
|
||
).images[0]
|
||
```
|
||
|
||
</hfoption>
|
||
<hfoption id="group offloading">
|
||
|
||
[组卸载](./memory#group-offloading) 将单个管道组件(如变换器模型)的内部层移动到 GPU 进行计算,并在不需要时将其卸载。同时,它使用 [CUDA 流](./memory#cuda-stream) 功能来预取下一层以执行。
|
||
|
||
通过重叠计算和数据传输,它比模型 CPU 卸载更快,同时还能节省内存。
|
||
|
||
```py
|
||
# pip install ftfy
|
||
import torch
|
||
from diffusers import AutoModel, DiffusionPipeline
|
||
from diffusers.hooks import apply_group_offloading
|
||
from diffusers.utils import export_to_video
|
||
from diffusers.quantizers import PipelineQuantizationConfig
|
||
from transformers import UMT5EncoderModel
|
||
|
||
torch._dynamo.config.cache_size_limit = 1000
|
||
torch._dynamo.config.capture_dynamic_output_shape_ops = True
|
||
|
||
# 量化
|
||
pipeline_quant_config = PipelineQuantizationConfig(
|
||
quant_backend="bitsandbytes_4bit",
|
||
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
|
||
components_to_quantize=["transformer", "text_encoder"],
|
||
)
|
||
|
||
text_encoder = UMT5EncoderModel.from_pretrained(
|
||
"Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16
|
||
)
|
||
pipeline = DiffusionPipeline.from_pretrained(
|
||
"Wan-AI/Wan2.1-T2V-14B-Diffusers",
|
||
quantization_config=pipeline_quant_config,
|
||
torch_dtype=torch.bfloat16,
|
||
).to("cuda")
|
||
|
||
# 组卸载
|
||
onload_device = torch.device("cuda")
|
||
offload_device = torch.device("cpu")
|
||
|
||
pipeline.transformer.enable_group_offload(
|
||
onload_device=onload_device,
|
||
offload_device=offload_device,
|
||
offload_type="leaf_level",
|
||
use_stream=True,
|
||
non_blocking=True
|
||
)
|
||
pipeline.vae.enable_group_offload(
|
||
onload_device=onload_device,
|
||
offload_device=offload_device,
|
||
offload_type="leaf_level",
|
||
use_stream=True,
|
||
non_blocking=True
|
||
)
|
||
apply_group_offloading(
|
||
pipeline.text_encoder,
|
||
onload_device=onload_device,
|
||
offload_type="leaf_level",
|
||
use_stream=True,
|
||
non_blocking=True
|
||
)
|
||
|
||
# 编译
|
||
pipeline.transformer.compile()
|
||
|
||
prompt = """
|
||
The camera rushes from far to near in a low-angle shot,
|
||
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
|
||
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
|
||
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
|
||
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
|
||
"""
|
||
negative_prompt = """
|
||
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
|
||
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
|
||
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
|
||
"""
|
||
|
||
output = pipeline(
|
||
prompt=prompt,
|
||
negative_prompt=negative_prompt,
|
||
num_frames=81,
|
||
guidance_scale=5.0,
|
||
).frames[0]
|
||
export_to_video(output, "output.mp4", fps=16)
|
||
```
|
||
|
||
</hfoption>
|
||
</hfoptions> |