mirror of
https://github.com/huggingface/diffusers.git
synced 2026-01-27 17:22:53 +03:00
[docs] CP (#12331)
* init * feedback * feedback * feedback * feedback * feedback * feedback
This commit is contained in:
@@ -70,8 +70,6 @@
|
||||
title: Reduce memory usage
|
||||
- local: optimization/speed-memory-optims
|
||||
title: Compiling and offloading quantized models
|
||||
- local: api/parallel
|
||||
title: Parallel inference
|
||||
- title: Community optimizations
|
||||
sections:
|
||||
- local: optimization/pruna
|
||||
@@ -282,6 +280,8 @@
|
||||
title: Outputs
|
||||
- local: api/quantization
|
||||
title: Quantization
|
||||
- local: api/parallel
|
||||
title: Parallel inference
|
||||
- title: Modular
|
||||
sections:
|
||||
- local: api/modular_diffusers/pipeline
|
||||
|
||||
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
|
||||
|
||||
# Parallelism
|
||||
|
||||
Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times.
|
||||
Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times. Refer to the [Distributed inferece](../training/distributed_inference) guide to learn more.
|
||||
|
||||
## ParallelConfig
|
||||
|
||||
|
||||
@@ -226,8 +226,64 @@ with torch.no_grad():
|
||||
image[0].save("split_transformer.png")
|
||||
```
|
||||
|
||||
## Resources
|
||||
By selectively loading and unloading the models you need at a given stage and sharding the largest models across multiple GPUs, it is possible to run inference with large models on consumer GPUs.
|
||||
|
||||
- Take a look at this [script](https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba) for a minimal example of distributed inference with Accelerate.
|
||||
- For more details, check out Accelerate's [Distributed inference](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
|
||||
- The `device_map` argument assign models or an entire pipeline to devices. Refer to the [device placement](../using-diffusers/loading#device-placement) docs for more information.
|
||||
## Context parallelism
|
||||
|
||||
[Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism) splits input sequences across multiple GPUs to reduce memory usage. Each GPU processes its own slice of the sequence.
|
||||
|
||||
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized attention backend. Refer to this [table](../optimization/attention_backends#available-backends) for a complete list of available backends.
|
||||
|
||||
### Ring Attention
|
||||
|
||||
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
|
||||
|
||||
Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transformer model. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoModel, QwenImagePipeline, ContextParallelConfig
|
||||
|
||||
try:
|
||||
torch.distributed.init_process_group("nccl")
|
||||
rank = torch.distributed.get_rank()
|
||||
device = torch.device("cuda", rank % torch.cuda.device_count())
|
||||
torch.cuda.set_device(device)
|
||||
|
||||
transformer = AutoModel.from_pretrained("Qwen/Qwen-Image", subfolder="transformer", torch_dtype=torch.bfloat16, parallel_config=ContextParallelConfig(ring_degree=2))
|
||||
pipeline = QwenImagePipeline.from_pretrained("Qwen/Qwen-Image", transformer=transformer, torch_dtype=torch.bfloat16, device_map="cuda")
|
||||
pipeline.transformer.set_attention_backend("flash")
|
||||
|
||||
prompt = """
|
||||
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
|
||||
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
|
||||
"""
|
||||
|
||||
# Must specify generator so all ranks start with same latents (or pass your own)
|
||||
generator = torch.Generator().manual_seed(42)
|
||||
image = pipeline(prompt, num_inference_steps=50, generator=generator).images[0]
|
||||
|
||||
if rank == 0:
|
||||
image.save("output.png")
|
||||
|
||||
except Exception as e:
|
||||
print(f"An error occurred: {e}")
|
||||
torch.distributed.breakpoint()
|
||||
raise
|
||||
|
||||
finally:
|
||||
if torch.distributed.is_initialized():
|
||||
torch.distributed.destroy_process_group()
|
||||
```
|
||||
|
||||
### Ulysses Attention
|
||||
|
||||
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all* communication (every device sends/receives data to every other device). Each GPU ends up with all tokens for only a subset of attention heads. Each GPU computes attention locally on all tokens for its head, then performs another all-to-all to regroup results by tokens for the next layer.
|
||||
|
||||
[`ContextParallelConfig`] supports Ulysses Attention through the `ulysses_degree` argument. This determines how many devices to use for Ulysses Attention.
|
||||
|
||||
Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
|
||||
|
||||
```py
|
||||
pipeline.transformer.enable_parallelism(config=ContextParallelConfig(ulysses_degree=2))
|
||||
```
|
||||
Reference in New Issue
Block a user