1
0
mirror of https://github.com/huggingface/diffusers.git synced 2026-01-29 07:22:12 +03:00

update doc

This commit is contained in:
yiyixuxu
2025-06-25 08:52:15 +02:00
parent 48e4ff5c05
commit e49413d87d
3 changed files with 873 additions and 215 deletions

View File

@@ -89,8 +89,8 @@
- sections:
- local: modular_diffusers/developer_guide
title: Developer Guide
- local: modular_diffusers/overview
title: Overview
- local: modular_diffusers/quicktour
title: Quicktour
- sections:
- local: using-diffusers/cogvideox
title: CogVideoX

View File

@@ -1,213 +0,0 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Overview
The Modular Diffusers Framework consists of three main components:
## ModularPipelineBlocks
Pipeline blocks are the fundamental building blocks of the Modular Diffusers system. All pipeline blocks inherit from the base class `ModularPipelineBlocks`, including:
- [`PipelineBlock`](TODO)
- [`SequentialPipelineBlocks`](TODO)
- [`LoopSequentialPipelineBlocks`](TODO)
- [`AutoPipelineBlocks`](TODO)
To use a `ModularPipelineBlocks` officially supported in 🧨 Diffusers
```py
>>> from diffusers.modular_pipelines.stable_diffusion_xl import StableDiffusionXLTextEncoderStep
>>> text_encoder_block = StableDiffusionXLTextEncoderStep()
```
Each [`ModularPipelineBlocks`] defines its requirement for components, configs, inputs, intermediate inputs, and outputs. You'll see that this text encoder block uses text_encoders, tokenizers as well as a guider component. It takes user inputs such as `prompt` and `negative_prompt`, and return a list of conditional text embeddings.
```
>>> text_encoder_block
StableDiffusionXLTextEncoderStep(
Class: PipelineBlock
Description: Text Encoder step that generate text_embeddings to guide the image generation
Components:
text_encoder (`CLIPTextModel`)
text_encoder_2 (`CLIPTextModelWithProjection`)
tokenizer (`CLIPTokenizer`)
tokenizer_2 (`CLIPTokenizer`)
guider (`ClassifierFreeGuidance`)
Configs:
force_zeros_for_empty_prompt (default: True)
Inputs:
prompt=None, prompt_2=None, negative_prompt=None, negative_prompt_2=None, cross_attention_kwargs=None, clip_skip=None
Intermediates:
- outputs: prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds
)
```
Pipeline blocks are essentially **"definitions"** - they define the specifications and computational steps for a pipeline. However, they do not contain any model states, and are not runnable until converted into a `ModularPipeline` object.
Read more about how to write your own `ModularPipelineBlocks` [here](TODO)
## PipelineState & BlockState
`PipelineState` and `BlockState` manage dataflow between pipeline blocks. `PipelineState` acts as the global state container that `ModularPipelineBlocks` operate on - each block gets a local view (`BlockState`) of the relevant variables it needs from `PipelineState`, performs its operations, and then updates PipelineState with any changes.
<Tip>
You typically don't need to manually create or manage these state objects. The `ModularPipeline` automatically creates and manages them for you. However, understanding their roles is important for developing custom pipeline blocks.
</Tip>
## ModularPipeline
`ModularPipeline` is the main interface to create and execute pipelines in the Modular Diffusers system.
### Modular Repo
`ModularPipeline` only works with modular repositories. You can find an example modular repo [here](https://huggingface.co/YiYiXu/modular-diffdiff).
The main differences from standard diffusers repositories are:
1. `modular_model_index.json` vs `model_index.json`
In standard `model_index.json`, each component entry is a `(library, class)` tuple:
```py
"text_encoder": [
"transformers",
"CLIPTextModel"
],
```
In `modular_model_index.json`, each component entry contains 3 elements: `(library, class, loading_specs {})`
- `library` and `class`: Information about the actual component loaded in the pipeline at the time of saving (can be `None` if not loaded)
- `loading_specs`: A dictionary containing all information required to load this component, including `repo`, `revision`, `subfolder`, `variant`, and `type_hint`
```py
"text_encoder": [
null, # library (same as model_index.json)
null, # class (same as model_index.json)
{ # loading specs map (unique to modular_model_index.json)
"repo": "stabilityai/stable-diffusion-xl-base-1.0", # can be a different repo
"revision": null,
"subfolder": "text_encoder",
"type_hint": [ # (library, class) for the expected component class
"transformers",
"CLIPTextModel"
],
"variant": null
}
],
```
2. Cross-Repository Component Loading
Unlike standard repositories where components must be in subfolders within the same repo, modular repositories can fetch components from different repositories based on the `loading_specs` dictionary. e.g. the `text_encoder` component will be fetched from the "text_encoder" folder in `stabilityai/stable-diffusion-xl-base-1.0` while other components come from different repositories.
### Create a `ModularPipeline` from `ModularPipelineBlocks`
Each `ModularPipelineBlocks` has an `init_pipeline` method that can initialize a `ModularPipeline` object based on its component and configuration specifications.
```py
>>> pipeline = blocks.init_pipeline(pretrained_model_name_or_path)
```
<Tip>
💡 We recommend using `ModularPipeline` with Component Manager by passing a `components_manager`:
```py
>>> components = ComponentsManager()
>>> pipeline = blocks.init_pipeline(pretrained_model_name_or_path, components_manager=components)
```
This helps you to:
1. Detect and manage duplicated models (warns when trying to register an existing model)
2. Easily reuse components across different pipelines
3. Apply offloading strategies across multiple pipelines
You can read more about Components Manager [here](TODO)
</Tip>
Unlike `DiffusionPipeline`, you need to explicitly load model components using `load_components`:
```py
>>> pipeline.load_components(torch_dtype=torch.float16)
>>> pipeline.to(device)
```
You can partially load specific components using the `component_names` argument, for example to only load unet and vae:
```py
>>> pipeline.load_components(component_names=["unet", "vae"])
```
<Tip>
💡 You can inspect the pipeline's `config` attribute (which contains the same structure as `modular_model_index.json` we just walked through) to check the "loading status" of the pipeline, e.g. what components this pipeline expects to load and their loading specs, what components are already loaded and their actual class & loading specs etc.
</Tip>
### Load a `ModularPipeline` from hub
You can create a `ModularPipeline` from a HuggingFace Hub repository with `from_pretrained` method, as long as it's a modular repo:
```py
pipeline = ModularPipeline.from_pretrained(repo_id, components_manager=..., collection=...)
```
Loading custom code is also supported:
```py
diffdiff_pipeline = ModularPipeline.from_pretrained(repo_id, trust_remote_code=True, ...)
```
Similar to `init_pipeline` method, the modular pipeline will not load any components automatically, so you will have to call `load_components` to explicitly load the components you need.
### Execute a `ModularPipeline`
The API to run the `ModularPipeline` is very similar to how you would run a regular `DiffusionPipeline`:
```py
>>> image = pipeline(prompt="a cat", num_inference_steps=15, output="images")[0]
```
There are a few key differences though:
1. You can also pass a `PipelineState` object directly to the pipeline instead of individual arguments
2. If you do not specify the `output` argument, it returns the `PipelineState` object
3. You can pass a list as `output`, e.g. `pipeline(... output=["images", "latents"])` will return a dictionary containing both the generated image and the final denoised latents
Under the hood, `ModularPipeline`'s `__call__` method is a wrapper around the pipeline blocks' `__call__` method: it creates a `PipelineState` object and populates it with user inputs, then returns the output to the user based on the `output` argument. It also ensures that all pipeline-level config and components are exposed to all pipeline blocks by preparing and passing a `components` input.
### Save a `ModularPipeline`
To save a `ModularPipeline` and publish it to hub:
```py
pipeline.save_pretrained("YiYiXu/modular-loader-t2i", push_to_hub=True)
```
<Tip>
We do not automatically save custom code and share it on hub for you. Please read more about how to share your custom pipeline on hub [here](TODO: ModularPipeline/CustomCode)
</Tip>

View File

@@ -0,0 +1,871 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Getting Started with Modular Diffusers
With Modular Diffusers, we introduce a unified pipeline system that simplifies how you work with diffusion models. Instead of creating separate pipelines for each task, Modular Diffusers let you:
**Write Only What's New**: You won't need to rewrite the entire pipeline from scratch. You can create pipeline blocks just for your new workflow's unique aspects and reuse existing blocks for existing functionalities.
**Assemble Like LEGO®**: You can mix and match blocks in flexible ways. This allows you to write dedicated blocks for specific workflows, and then assemble different blocks into a pipeline that that can be used more conveniently for multiple workflows.
In this guide, we will focus on how to use pipeline like this we built with Modular diffusers 🧨! We will also go over the basics of pipeline blocks, how they work under the hood, and how to assemble SequentialPipelineBlocks and AutoPipelineBlocks in this [guide](TODO). For advanced users who want to build complete workflows from scratch, we provide an end-to-end example in the [Developer Guide](developer_guide.md) that covers everything from writing custom pipeline blocks to deploying your workflow as a UI node.
Let's get started! The Modular Diffusers Framework consists of three main components:
## ModularPipelineBlocks
Pipeline blocks are the fundamental building blocks of the Modular Diffusers system. All pipeline blocks inherit from the base class `ModularPipelineBlocks`, including:
- [`PipelineBlock`](TODO)
- [`SequentialPipelineBlocks`](TODO)
- [`LoopSequentialPipelineBlocks`](TODO)
- [`AutoPipelineBlocks`](TODO)
To use a `ModularPipelineBlocks` officially supported in 🧨 Diffusers
```py
>>> from diffusers.modular_pipelines.stable_diffusion_xl import StableDiffusionXLTextEncoderStep
>>> text_encoder_block = StableDiffusionXLTextEncoderStep()
```
Each [`ModularPipelineBlocks`] defines its requirement for components, configs, inputs, intermediate inputs, and outputs. You'll see that this text encoder block uses 2 text_encoders, 2 tokenizers as well as a guider component. It takes user inputs such as `prompt` and `negative_prompt`, and return text embeddings such as `prompt_embeds` and `negative_prompt_embeds`.
```
>>> text_encoder_block
StableDiffusionXLTextEncoderStep(
Class: PipelineBlock
Description: Text Encoder step that generate text_embeddings to guide the image generation
Components:
text_encoder (`CLIPTextModel`)
text_encoder_2 (`CLIPTextModelWithProjection`)
tokenizer (`CLIPTokenizer`)
tokenizer_2 (`CLIPTokenizer`)
guider (`ClassifierFreeGuidance`)
Configs:
force_zeros_for_empty_prompt (default: True)
Inputs:
prompt=None, prompt_2=None, negative_prompt=None, negative_prompt_2=None, cross_attention_kwargs=None, clip_skip=None
Intermediates:
- outputs: prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds
)
```
More commonly, you can create a `SequentialPipelineBlocks` using a modular blocks preset officially supported in 🧨 Diffusers.
```py
from diffusers.modular_pipelines import SequentialPipelineBlocks
from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS
t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)
```
This creates a text-to-image pipeline.
```py
>>> t2i_blocks
SequentialPipelineBlocks(
Class: ModularPipelineBlocks
Description:
Components:
text_encoder (`CLIPTextModel`)
text_encoder_2 (`CLIPTextModelWithProjection`)
tokenizer (`CLIPTokenizer`)
tokenizer_2 (`CLIPTokenizer`)
guider (`ClassifierFreeGuidance`)
scheduler (`EulerDiscreteScheduler`)
unet (`UNet2DConditionModel`)
vae (`AutoencoderKL`)
image_processor (`VaeImageProcessor`)
Configs:
force_zeros_for_empty_prompt (default: True)
Blocks:
[0] text_encoder (StableDiffusionXLTextEncoderStep)
Description: Text Encoder step that generate text_embeddings to guide the image generation
[1] input (StableDiffusionXLInputStep)
Description: Input processing step that:
1. Determines `batch_size` and `dtype` based on `prompt_embeds`
2. Adjusts input tensor shapes based on `batch_size` (number of prompts) and `num_images_per_prompt`
All input tensors are expected to have either batch_size=1 or match the batch_size
of prompt_embeds. The tensors will be duplicated across the batch dimension to
have a final batch_size of batch_size * num_images_per_prompt.
[2] set_timesteps (StableDiffusionXLSetTimestepsStep)
Description: Step that sets the scheduler's timesteps for inference
[3] prepare_latents (StableDiffusionXLPrepareLatentsStep)
Description: Prepare latents step that prepares the latents for the text-to-image generation process
[4] prepare_add_cond (StableDiffusionXLPrepareAdditionalConditioningStep)
Description: Step that prepares the additional conditioning for the text-to-image generation process
[5] denoise (StableDiffusionXLDenoiseLoop)
Description: Denoise step that iteratively denoise the latents.
Its loop logic is defined in `StableDiffusionXLDenoiseLoopWrapper.__call__` method
At each iteration, it runs blocks defined in `blocks` sequencially:
- `StableDiffusionXLLoopBeforeDenoiser`
- `StableDiffusionXLLoopDenoiser`
- `StableDiffusionXLLoopAfterDenoiser`
[6] decode (StableDiffusionXLDecodeStep)
Description: Step that decodes the denoised latents into images
)
```
The blocks preset we used (`TEXT2IMAGE_BLOCKS`) is just a dictionary that maps names to ModularPipelineBlocks classes
```py
>>> TEXT2IMAGE_BLOCKS
InsertableOrderedDict([
0: ('text_encoder', <class 'diffusers.modular_pipelines.stable_diffusion_xl.encoders.StableDiffusionXLTextEncoderStep'>),
1: ('input', <class 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLInputStep'>),
2: ('set_timesteps', <class 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLSetTimestepsStep'>),
3: ('prepare_latents', <class 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLPrepareLatentsStep'>),
4: ('prepare_add_cond', <class 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLPrepareAdditionalConditioningStep'>),
5: ('denoise', <class 'diffusers.modular_pipelines.stable_diffusion_xl.denoise.StableDiffusionXLDenoiseLoop'>),
6: ('decode', <class 'diffusers.modular_pipelines.stable_diffusion_xl.decoders.StableDiffusionXLDecodeStep'>)
])
```
When we create a `SequentialPipelineBlocks` from this preset, it instantiates each class into actual block objects. Its `blocks` attribute contains these instantiated objects:
```py
>>> t2i_blocks.blocks
InsertableOrderedDict([
0: ('text_encoder', <obj 'diffusers.modular_pipelines.stable_diffusion_xl.encoders.StableDiffusionXLTextEncoderStep'>),
1: ('input', <obj 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLInputStep'>),
2: ('set_timesteps', <obj 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLSetTimestepsStep'>),
3: ('prepare_latents', <obj 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLPrepareLatentsStep'>),
4: ('prepare_add_cond', <obj 'diffusers.modular_pipelines.stable_diffusion_xl.before_denoise.StableDiffusionXLPrepareAdditionalConditioningStep'>),
5: ('denoise', <obj 'diffusers.modular_pipelines.stable_diffusion_xl.denoise.StableDiffusionXLDenoiseLoop'>),
6: ('decode', <obj 'diffusers.modular_pipelines.stable_diffusion_xl.decoders.StableDiffusionXLDecodeStep'>)
])
```
Note that both the preset and the `blocks` attribute are `InsertableOrderedDict` objects, which allows you to modify them in several ways:
**Add a block/block_class at specific positions:**
```py
# Add to preset (class)
BLOCKS.insert("block_name", BlockClass, index)
# Add to blocks attribute (instance)
t2i_blocks.blocks.insert("block_name", block_instance, index)
```
**Remove blocks:**
```py
# remove a block class from preset
BLOCKS.pop("text_encoder")
# split out a block instance on its own
text_encoder_block = t2i_blocks.blocks.pop("text_encoder")
```
**Swap/replace blocks:**
```py
# Replace in preset (class)
BLOCKS["prepare_latents"] = CustomPrepareLatents
# Replace in blocks attribute (instance)
t2i_blocks.blocks["prepare_latents"] = CustomPrepareLatents()
```
This means you can mix-and-match blocks in very flexible ways. Let's see some real examples:
**Example 1: Adding IP-Adapter to the preset**
Let's insert IP-Adapter at index 0 (before the text_encoder block) to create a text-to-image pipeline with IP-Adapter support:
```py
from diffusers.modular_pipelines.stable_diffusion_xl import StableDiffusionXLAutoIPAdapterStep
CUSTOM_BLOCKS = TEXT2IMAGE_BLOCKS.copy()
CUSTOM_BLOCKS.insert("ip_adapter", StableDiffusionXLAutoIPAdapterStep, 0)
custom_blocks = SequentialPipelineBlocks.from_blocks_dict(CUSTOM_BLOCKS)
```
**Example 2: Extracting a block from the pipeline**
You can extract a block instance from the pipeline to use it independently. A common pattern is to extract the text_encoder to process prompts once, then reuse the text embeddings to generate multiple images with different settings (schedulers, seeds, inference steps).
```py
>>> text_encoder_blocks = t2i_blocks.blocks.pop("text_encoder")
>>> text_encoder_blocks
StableDiffusionXLTextEncoderStep(
Class: PipelineBlock
Description: Text Encoder step that generate text_embeddings to guide the image generation
Components:
text_encoder (`CLIPTextModel`)
text_encoder_2 (`CLIPTextModelWithProjection`)
tokenizer (`CLIPTokenizer`)
tokenizer_2 (`CLIPTokenizer`)
guider (`ClassifierFreeGuidance`)
Configs:
force_zeros_for_empty_prompt (default: True)
Inputs:
prompt=None, prompt_2=None, negative_prompt=None, negative_prompt_2=None, cross_attention_kwargs=None, clip_skip=None
Intermediates:
- outputs: prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds
)
```
the pipeline now has fewer components and no longer has the `text_encoder` block:
```py
>>> t2i_blocks
SequentialPipelineBlocks(
Class: ModularPipelineBlocks
Description:
Components:
scheduler (`EulerDiscreteScheduler`)
guider (`ClassifierFreeGuidance`)
unet (`UNet2DConditionModel`)
vae (`AutoencoderKL`)
image_processor (`VaeImageProcessor`)
Blocks:
[0] input (StableDiffusionXLInputStep)
Description: Input processing step that:
1. Determines `batch_size` and `dtype` based on `prompt_embeds`
2. Adjusts input tensor shapes based on `batch_size` (number of prompts) and `num_images_per_prompt`
All input tensors are expected to have either batch_size=1 or match the batch_size
of prompt_embeds. The tensors will be duplicated across the batch dimension to
have a final batch_size of batch_size * num_images_per_prompt.
[1] set_timesteps (StableDiffusionXLSetTimestepsStep)
Description: Step that sets the scheduler's timesteps for inference
[2] prepare_latents (StableDiffusionXLPrepareLatentsStep)
Description: Prepare latents step that prepares the latents for the text-to-image generation process
[3] prepare_add_cond (StableDiffusionXLPrepareAdditionalConditioningStep)
Description: Step that prepares the additional conditioning for the text-to-image generation process
[4] denoise (StableDiffusionXLDenoiseLoop)
Description: Denoise step that iteratively denoise the latents.
Its loop logic is defined in `StableDiffusionXLDenoiseLoopWrapper.__call__` method
At each iteration, it runs blocks defined in `blocks` sequencially:
- `StableDiffusionXLLoopBeforeDenoiser`
- `StableDiffusionXLLoopDenoiser`
- `StableDiffusionXLLoopAfterDenoiser`
[5] decode (StableDiffusionXLDecodeStep)
Description: Step that decodes the denoised latents into images
)
```
We will not go over how to write your own ModularPipelineBlocks but you can learn more about it [here](TODO).
This covers the essentials of pipeline blocks! You may have noticed that we haven't discussed how to load or run pipeline blocks - that's because **pipeline blocks are not runnable by themselves**. They are essentially **"definitions"** - they define the specifications and computational steps for a pipeline, but they do not contain any model states. To actually run them, you need to convert them into a `ModularPipeline` object.
## PipelineState & BlockState
`PipelineState` and `BlockState` manage dataflow between pipeline blocks. `PipelineState` acts as the global state container that `ModularPipelineBlocks` operate on - each block gets a local view (`BlockState`) of the relevant variables it needs from `PipelineState`, performs its operations, and then updates PipelineState with any changes.
<Tip>
You typically don't need to manually create or manage these state objects. The `ModularPipeline` automatically creates and manages them for you. However, understanding their roles is important for developing custom pipeline blocks.
</Tip>
## ModularPipeline
`ModularPipeline` is the main interface to create and execute pipelines in the Modular Diffusers system.
### Modular Repo
`ModularPipeline` only works with modular repositories. You can find an example modular repo [here](https://huggingface.co/YiYiXu/modular-diffdiff).
Instead of using a `model_index.json` to configure components loading in `DiffusionPipeline`. Modular repositories work with `modular_model_index.json`. Let's walk through the difference here.
In standard `model_index.json`, each component entry is a `(library, class)` tuple:
```py
"text_encoder": [
"transformers",
"CLIPTextModel"
],
```
In `modular_model_index.json`, each component entry contains 3 elements: `(library, class, loading_specs {})`
- `library` and `class`: Information about the actual component loaded in the pipeline at the time of saving (can be `None` if not loaded)
- `loading_specs`: A dictionary containing all information required to load this component, including `repo`, `revision`, `subfolder`, `variant`, and `type_hint`
```py
"text_encoder": [
null, # library (same as model_index.json)
null, # class (same as model_index.json)
{ # loading specs map (unique to modular_model_index.json)
"repo": "stabilityai/stable-diffusion-xl-base-1.0", # can be a different repo
"revision": null,
"subfolder": "text_encoder",
"type_hint": [ # (library, class) for the expected component class
"transformers",
"CLIPTextModel"
],
"variant": null
}
],
```
Unlike standard repositories where components must be in subfolders within the same repo, modular repositories can fetch components from different repositories based on the `loading_specs` dictionary. e.g. the `text_encoder` component will be fetched from the "text_encoder" folder in `stabilityai/stable-diffusion-xl-base-1.0` while other components come from different repositories.
### Creating a `ModularPipeline` from `ModularPipelineBlocks`
Each `ModularPipelineBlocks` has an `init_pipeline` method that can initialize a `ModularPipeline` object based on its component and configuration specifications.
Let's convert our `t2i_blocks` (which we created earlier) into a runnable `ModularPipeline`:
```py
# We already have this from earlier
t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)
# Now convert it to a ModularPipeline
modular_repo_id = "YiYiXu/modular-loader-t2i"
t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id)
```
<Tip>
💡 We recommend using `ModularPipeline` with Component Manager by passing a `components_manager`:
```py
>>> components = ComponentsManager()
>>> pipeline = blocks.init_pipeline(modular_repo_id, components_manager=components)
```
This helps you to:
1. Detect and manage duplicated models (warns when trying to register an existing model)
2. Easily reuse components across different pipelines
3. Apply offloading strategies across multiple pipelines
You can read more about Components Manager [here](TODO)
</Tip>
### Creating a `ModularPipeline` with `from_pretrained`
You can create a `ModularPipeline` from a HuggingFace Hub repository with `from_pretrained` method, as long as it's a modular repo:
```py
# YiYi TODO: this is not yet supported actually 😢, need to add support
from diffusers import ModularPipeline
pipeline = ModularPipeline.from_pretrained(repo_id, components_manager=..., collection=...)
```
Loading custom code is also supported:
```py
from diffusers import ModularPipeline
modular_repo_id = "YiYiXu/modular-diffdiff"
diffdiff_pipeline = ModularPipeline.from_pretrained(modular_repo_id, trust_remote_code=True)
```
### Loading components into a `ModularPipeline`
Unlike `DiffusionPipeline`, when you create a `ModularPipeline` instance (whether using `from_pretrained` or converting from pipeline blocks), its components aren't loaded automatically. You need to explicitly load model components using `load_components`:
```py
# This will load ALL the expected components into pipeline
t2i_pipeline.load_components(torch_dtype=torch.float16)
t2i_pipeline.to(device)
```
All expected components are now loaded into the pipeline. You can also partially load specific components using the `component_names` argument. For example, to only load unet and vae:
```py
>>> t2i_pipeline.load_components(component_names=["unet", "vae"])
```
You can inspect the pipeline's loading status through its `loader` attribute to understand what components are expected to load, which ones are already loaded, how they were loaded, and what loading specs are available. It has the same structure as the `modular_model_index.json` we discussed earlier - each component entry contains the `(library, class, loading_specs)` format. You'll need to understand that structure to properly read the loading status below.
Let's inspect the `t2i_pipeline`, you can see all the components expected to load are listed as entries in the loader. The `guider` and `image_processor` components were created using default config (their `library` and `class` field are populated, this means they are initialized, but `loading_spec["repo"]` is null). The `vae` and `unet` components were loaded using their respective loading specs. The rest of the components (scheduler, text_encoder, text_encoder_2, tokenizer, tokenizer_2) are not loaded yet (their `library`, `class` fields are `null`), but you can examine their loading specs to see where they would be loaded from when you call `load_components()`.
```py
>>> t2i_pipeline.loader
StableDiffusionXLModularLoader {
"_class_name": "StableDiffusionXLModularLoader",
"_diffusers_version": "0.34.0.dev0",
"force_zeros_for_empty_prompt": true,
"guider": [
"diffusers",
"ClassifierFreeGuidance",
{
"repo": null,
"revision": null,
"subfolder": null,
"type_hint": [
"diffusers",
"ClassifierFreeGuidance"
],
"variant": null
}
],
"image_processor": [
"diffusers",
"VaeImageProcessor",
{
"repo": null,
"revision": null,
"subfolder": null,
"type_hint": [
"diffusers",
"VaeImageProcessor"
],
"variant": null
}
],
"scheduler": [
null,
null,
{
"repo": "stabilityai/stable-diffusion-xl-base-1.0",
"revision": null,
"subfolder": "scheduler",
"type_hint": [
"diffusers",
"EulerDiscreteScheduler"
],
"variant": null
}
],
"text_encoder": [
null,
null,
{
"repo": "stabilityai/stable-diffusion-xl-base-1.0",
"revision": null,
"subfolder": "text_encoder",
"type_hint": [
"transformers",
"CLIPTextModel"
],
"variant": null
}
],
"text_encoder_2": [
null,
null,
{
"repo": "stabilityai/stable-diffusion-xl-base-1.0",
"revision": null,
"subfolder": "text_encoder_2",
"type_hint": [
"transformers",
"CLIPTextModelWithProjection"
],
"variant": null
}
],
"tokenizer": [
null,
null,
{
"repo": "stabilityai/stable-diffusion-xl-base-1.0",
"revision": null,
"subfolder": "tokenizer",
"type_hint": [
"transformers",
"CLIPTokenizer"
],
"variant": null
}
],
"tokenizer_2": [
null,
null,
{
"repo": "stabilityai/stable-diffusion-xl-base-1.0",
"revision": null,
"subfolder": "tokenizer_2",
"type_hint": [
"transformers",
"CLIPTokenizer"
],
"variant": null
}
],
"unet": [
"diffusers",
"UNet2DConditionModel",
{
"repo": "RunDiffusion/Juggernaut-XL-v9",
"revision": null,
"subfolder": "unet",
"type_hint": [
"diffusers",
"UNet2DConditionModel"
],
"variant": "fp16"
}
],
"vae": [
"diffusers",
"AutoencoderKL",
{
"repo": "madebyollin/sdxl-vae-fp16-fix",
"revision": null,
"subfolder": null,
"type_hint": [
"diffusers",
"AutoencoderKL"
],
"variant": null
}
]
}
```
### Updating components in a `ModularPipeline`
Similar to `DiffusionPipeline`, You could load an components separately to replace the default one in the pipeline. But in Modular Diffusers system, you need to use `ComponentSpec` to load/create them.
`ComponentSpec` defines how to create or load components and can actually create them using its `create()` method (for ConfigMixin objects) or `load()` method (wrapper around `from_pretrained()`). When a component is loaded with a ComponentSpec, it gets tagged with a unique ID that encodes its creation parameters, allowing you to always extract the original specification using `ComponentSpec.from_component()`. In Modular Diffusers, all pretrained models should be loaded using `ComponentSpec` objects.
So instead of
```py
from diffusers import UNet2DConditionModel
import torch
unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", variant="fp16", torch_dtype=torch.float16)
```
You should do
```py
from diffusers import ComponentSpec, UNet2DConditionModel
unet_spec = ComponentSpec(name="unet",type_hint=UNet2DConditionModel, repo="stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", variant="fp16")
unet2 = unet_spec.load(torch_dtype=torch.float16)
```
The key difference is that the second unet (the one we load with `ComponentSpec`) retains its loading specs, so you can extract and recreate it:
```py
# to extract spec, you can do spec.load() to recreate it
>>> spec = ComponentSpec.from_component("unet", unet2)
>>> spec
ComponentSpec(name='unet', type_hint=<class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>, description=None, config=None, repo='stabilityai/stable-diffusion-xl-base-1.0', subfolder='unet', variant='fp16', revision=None, default_creation_method='from_pretrained')
```
To replace the unet in the pipeline
```
t2i_pipeline.update_components(unet=unet2)
```
Not only is the `unet` component swapped, but its loading specs are also updated from "RunDiffusion/Juggernaut-XL-v9" to "stabilityai/stable-diffusion-xl-base-1.0". This means that if you save the pipeline now and load it back with `from_pretrained`, the new pipeline will by default load the SDXL original unet.
```
>>> t2i_pipeline.loader
StableDiffusionXLModularLoader {
...
"unet": [
"diffusers",
"UNet2DConditionModel",
{
"repo": "stabilityai/stable-diffusion-xl-base-1.0",
"revision": null,
"subfolder": "unet",
"type_hint": [
"diffusers",
"UNet2DConditionModel"
],
"variant": "fp16"
}
],
...
}
```
### Run a `ModularPipeline`
The API to run the `ModularPipeline` is very similar to how you would run a regular `DiffusionPipeline`:
```py
>>> image = pipeline(prompt="a cat", num_inference_steps=15, output="images")[0]
```
There are a few key differences though:
1. You can also pass a `PipelineState` object directly to the pipeline instead of individual arguments
2. If you do not specify the `output` argument, it returns the `PipelineState` object
3. You can pass a list as `output`, e.g. `pipeline(... output=["images", "latents"])` will return a dictionary containing both the generated image and the final denoised latents
Under the hood, `ModularPipeline`'s `__call__` method is a wrapper around the pipeline blocks' `__call__` method: it creates a `PipelineState` object and populates it with user inputs, then returns the output to the user based on the `output` argument. It also ensures that all pipeline-level config and components are exposed to all pipeline blocks by preparing and passing a `components` input.
<Tip>
You can inspect the docstring of a `ModularPipeline` to check what arguments the pipeline accepts and how to specify the `output` you want. It will list all available outputs (basically everything in the intermediate pipeline state) so you can choose from the list.
```py
t2i_pipeline.doc
```
</Tip>
```py
import torch
from diffusers.modular_pipelines import SequentialPipelineBlocks
from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS
t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)
modular_repo_id = "YiYiXu/modular-loader-t2i"
t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id)
t2i_pipeline.load_components(torch_dtype=torch.float16)
t2i_pipeline.to("cuda")
image = t2i_pipeline(prompt="a cat", output="images")[0]
image.save("modular_t2i_out.png")
```
## An slightly advanced Workflow
We've learned the basic components of the Modular Diffusers System. Now let's tie everything together with more practical example that demonstrates the true power of Modular Diffusers: working between with multiple pipelines that can share components.
```py
import torch
from diffusers.modular_pipelines import SequentialPipelineBlocks, ComponentsManager
from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS, IMAGE2IMAGE_BLOCKS
# create t2i blocks and then pop out the text_encoder step and decoder step so that we can use them in standalone manner
t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS.copy())
text_blocks = t2i_blocks.blocks.pop("text_encoder")
decoder_blocks = t2i_blocks.blocks.pop("decode")
# Create a refiner blocks
# - removing image_encoder a since we'll use latents from t2i
# - removing decode since we already created a seperate decoder_block
i2i_blocks_dict = IMAGE2IMAGE_BLOCKS.copy()
i2i_blocks_dict.pop("image_encoder")
i2i_blocks_dict.pop("decode")
refiner_blocks = SequentialPipelineBlocks.from_blocks_dict(i2i_blocks_dict)
# Set up component manager and turn on the offloading
components = ComponentsManager()
components.enable_auto_cpu_offload(device="cuda")
# convert all blocks into runnable pipelines: text_node, decoder_node, t2i_pipe, refiner_pipe
t2i_repo = "YiYiXu/modular-loader-t2i"
refiner_repo = "YiYiXu/modular_refiner"
dtype = torch.float16
text_node = text_blocks.init_pipeline(t2i_repo, component_manager=components, collection="t2i")
text_node.load_components(torch_dtype=dtype)
decoder_node = decoder_blocks.init_pipeline(t2i_repo, component_manager=components, collection="t2i")
decoder_node.load_components(torch_dtype=dtype)
t2i_pipe = t2i_blocks.init_pipeline(t2i_repo, component_manager=components, collection="t2i")
t2i_pipe.load_components(torch_dtype=dtype)
# for refiner pipeline, only unet is unique so we only load unet here, and we will reuse other components
refiner_pipe = refiner_blocks.init_pipeline(refiner_repo, component_manager=components, collection="refiner")
refiner_pipe.load_components(component_names="unet", torch_dtype=dtype)
```
let's inspect components manager here, you can see that 5 models are automatically registered: two text encoders, two UNets, and one VAE. The models are organized by collection - 4 models under "t2i" and one UNet under "refiner". This happens because we passed a `collection` parameter when initializing each pipeline. For example, when we created the refiner pipeline, we did `refiner_pipe = refiner_blocks.init_pipeline(refiner_repo, component_manager=components, collection="refiner")`. All models loaded by `refiner_pipe.load_components(...)` are automatically placed under the "refiner" collection.
Notice that all models are currently on CPU with execution device "cuda:0" - this is due to the auto CPU offloading strategy we enabled with `components.enable_auto_cpu_offload(device="cuda")`.
The manager also displays useful info like dtype and memory size for each model.
```py
>>> components
Components:
======================================================================================================================================================================================
Models:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name | Class | Device: act(exec) | Dtype | Size (GB) | Load ID | Collection
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
text_encoder_2 | CLIPTextModelWithProjection | cpu(cuda:0) | torch.float16 | 1.29 | stabilityai/stable-diffusion-xl-base-1.0|text_encoder_2|null|null | t2i
text_encoder | CLIPTextModel | cpu(cuda:0) | torch.float16 | 0.23 | stabilityai/stable-diffusion-xl-base-1.0|text_encoder|null|null | t2i
unet | UNet2DConditionModel | cpu(cuda:0) | torch.float16 | 4.78 | RunDiffusion/Juggernaut-XL-v9|unet|fp16|null | t2i
unet | UNet2DConditionModel | cpu(cuda:0) | torch.float16 | 4.21 | stabilityai/stable-diffusion-xl-refiner-1.0|unet|null|null | refiner
vae | AutoencoderKL | cpu(cuda:0) | torch.float16 | 0.16 | madebyollin/sdxl-vae-fp16-fix|null|null|null | t2i
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Other Components:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name | Class | Collection
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tokenizer | CLIPTokenizer | t2i
tokenizer_2 | CLIPTokenizer | t2i
scheduler | EulerDiscreteScheduler | t2i
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Additional Component Info:
==================================================
```
Now let's reuse components from the t2i pipeline in the refiner. First, let's check the loading status of the refiner pipeline to understand what components are needed:
```py
>>> refiner_pipe.loader
```
Looking at the loader output, you can see that `text_encoder` and `tokenizer` have empty loading spec maps (their `repo` fields are `null`), this is because refiner pipeline does not use these two components so they are not listed in the `modular_model_index.json` in `refiner_repo`. The `unet` is already correctly loaded from the refiner repository. We need to load the remaining components: `vae`, `text_encoder_2`, `tokenizer_2`, and `scheduler`. Since these components are already available in the t2i collection, we can reuse them instead of loading duplicates.
Now let's reuse the components from the t2i pipeline in the refiner. We use the`|` to select multiple components from components manager at once:
```py
# Reuse components from t2i pipeline (select everything at once)
reuse_components = components.get("text_encoder_2|scheduler|vae|tokenizer_2", as_name_component_tuples=True)
refiner_pipe.update_components(**dict(reuse_components))
```
You'll see warnings indicating that these components already exist in the components manager:
```out
component 'text_encoder_2' already exists as 'text_encoder_2_238ae9a7-c864-4837-a8a2-f58ed753b2d0'
component 'tokenizer_2' already exists as 'tokenizer_2_b795af3d-f048-4b07-a770-9e8237a2be2d'
component 'scheduler' already exists as 'scheduler_e3435f63-266a-4427-9383-eb812e830fe8'
component 'vae' already exists as 'vae_357eee6a-4a06-46f1-be83-494f7d60ca69'
```
These warnings are expected and indicate that the components manager is correctly identifying that these components are already loaded. The system will reuse the existing components rather than creating duplicates.
Let's check the components manager again to see the updated state:
```py
>>> components
Components:
======================================================================================================================================================================================
Models:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name | Class | Device: act(exec) | Dtype | Size (GB) | Load ID | Collection
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
text_encoder | CLIPTextModel | cpu(cuda:0) | torch.float16 | 0.23 | stabilityai/stable-diffusion-xl-base-1.0|text_encoder|null|null | t2i
text_encoder_2 | CLIPTextModelWithProjection | cpu(cuda:0) | torch.float16 | 1.29 | stabilityai/stable-diffusion-xl-base-1.0|text_encoder_2|null|null | t2i
| | | | | | refiner
vae | AutoencoderKL | cpu(cuda:0) | torch.float16 | 0.16 | madebyollin/sdxl-vae-fp16-fix|null|null|null | t2i
| | | | | | refiner
unet | UNet2DConditionModel | cpu(cuda:0) | torch.float16 | 4.78 | RunDiffusion/Juggernaut-XL-v9|unet|fp16|null | t2i
unet | UNet2DConditionModel | cpu(cuda:0) | torch.float16 | 4.21 | stabilityai/stable-diffusion-xl-refiner-1.0|unet|null|null | refiner
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Other Components:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name | Class | Collection
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tokenizer_2 | CLIPTokenizer | t2i
| | refiner
tokenizer | CLIPTokenizer | t2i
scheduler | EulerDiscreteScheduler | t2i
| | refiner
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Additional Component Info:
==================================================
```
Notice how `text_encoder_2`, `vae`, `tokenizer_2`, and `scheduler` now appear under both "t2i" and "refiner" collections.
We can start to generate an image with the t2i pipeline and refine it.
First to run the prompt through text_node to get prompt embeddings
<Tip>
💡 don't forget to `text_node.doc` to find out what outputs are available and set the `output` argument accordingly
</Tip>
```py
prompt = "A crystal orb resting on a wooden table with a yellow rubber duck, surrounded by aged scrolls and alchemy tools, illuminated by candlelight, detailed texture, high resolution image"
text_embeddings = text_node(prompt=prompt, output=["prompt_embeds","negative_prompt_embeds", "pooled_prompt_embeds", "negative_pooled_prompt_embeds"])
```
Now generate latents with t2i pipeline and then refine with refiner. Note that both our `t2i_pipe` and `refiner_pipe` do not have decoder steps since we separated them out earlier, so we need to use `output="latents"` instead of `output="images"`.
<Tip>
💡 `t2i_pipe.blocks` shows you what steps this pipeline takes. You can see that our `t2i_pipe` no longer includes the `text_encoder` and `decode` steps since we removed them earlier when we popped them out to create separate nodes.
```py
>>> t2i_pipe.blocks
SequentialPipelineBlocks(
Class: ModularPipelineBlocks
Description:
Components:
scheduler (`EulerDiscreteScheduler`)
guider (`ClassifierFreeGuidance`)
unet (`UNet2DConditionModel`)
Blocks:
[0] input (StableDiffusionXLInputStep)
Description: Input processing step that:
1. Determines `batch_size` and `dtype` based on `prompt_embeds`
2. Adjusts input tensor shapes based on `batch_size` (number of prompts) and `num_images_per_prompt`
All input tensors are expected to have either batch_size=1 or match the batch_size
of prompt_embeds. The tensors will be duplicated across the batch dimension to
have a final batch_size of batch_size * num_images_per_prompt.
[1] set_timesteps (StableDiffusionXLSetTimestepsStep)
Description: Step that sets the scheduler's timesteps for inference
[2] prepare_latents (StableDiffusionXLPrepareLatentsStep)
Description: Prepare latents step that prepares the latents for the text-to-image generation process
[3] prepare_add_cond (StableDiffusionXLPrepareAdditionalConditioningStep)
Description: Step that prepares the additional conditioning for the text-to-image generation process
[4] denoise (StableDiffusionXLDenoiseLoop)
Description: Denoise step that iteratively denoise the latents.
Its loop logic is defined in `StableDiffusionXLDenoiseLoopWrapper.__call__` method
At each iteration, it runs blocks defined in `blocks` sequencially:
- `StableDiffusionXLLoopBeforeDenoiser`
- `StableDiffusionXLLoopDenoiser`
- `StableDiffusionXLLoopAfterDenoiser`
)
```
</Tip>
```py
latents = t2i_pipe(**text_embeddings, num_inference_steps=25, output="latents")
refined_latents = refiner_pipe(image_latents=latents, prompt=prompt, num_inference_steps=10, output="latents")
```
To get the final images, we need to pass the latents through our separate decoder node:
```py
image = decoder_node(latents=latents, output="images")[0]
refined_image = decoder_node(latents=refined_latents, output="images")[0]
```
# YiYi TODO: maybe more on controlnet/lora/ip-adapter