diff --git a/docs/source/en/using-diffusers/controlling_generation.mdx b/docs/source/en/using-diffusers/controlling_generation.mdx
index 2660903517..b4b3a9bbcc 100644
--- a/docs/source/en/using-diffusers/controlling_generation.mdx
+++ b/docs/source/en/using-diffusers/controlling_generation.mdx
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# Controlled generation
-Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.
+Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.
Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject's pose.
@@ -41,31 +41,31 @@ Unless otherwise mentioned, these are techniques that work with existing models
13. [Model Editing](#model-editing)
14. [DiffEdit](#diffedit)
-For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.
+For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.
-| **Method** | **Inference only** | **Requires training /
fine-tuning** | **Comments** |
-|:---:|:---:|:---:|:---:|
-| [Instruct Pix2Pix](#instruct-pix2pix) | ✅ | ❌ | Can additionally be
fine-tuned for better
performance on specific
edit instructions. |
-| [Pix2Pix Zero](#pix2pixzero) | ✅ | ❌ | |
-| [Attend and Excite](#attend-and-excite) | ✅ | ❌ | |
-| [Semantic Guidance](#semantic-guidance) | ✅ | ❌ | |
-| [Self-attention Guidance](#self-attention-guidance) | ✅ | ❌ | |
-| [Depth2Image](#depth2image) | ✅ | ❌ | |
-| [MultiDiffusion Panorama](#multidiffusion-panorama) | ✅ | ❌ | |
-| [DreamBooth](#dreambooth) | ❌ | ✅ | |
-| [Textual Inversion](#textual-inversion) | ❌ | ✅ | |
-| [ControlNet](#controlnet) | ✅ | ❌ | A ControlNet can be
trained/fine-tuned on
a custom conditioning. |
-| [Prompt Weighting](#prompt-weighting) | ✅ | ❌ | |
-| [Custom Diffusion](#custom-diffusion) | ❌ | ✅ | |
-| [Model Editing](#model-editing) | ✅ | ❌ | |
-| [DiffEdit](#diffedit) | ✅ | ❌ | |
-| [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | |
+| **Method** | **Inference only** | **Requires training /
fine-tuning** | **Comments** |
+| :-------------------------------------------------: | :----------------: | :-------------------------------------: | :---------------------------------------------------------------------------------------------: |
+| [Instruct Pix2Pix](#instruct-pix2pix) | ✅ | ❌ | Can additionally be
fine-tuned for better
performance on specific
edit instructions. |
+| [Pix2Pix Zero](#pix2pixzero) | ✅ | ❌ | |
+| [Attend and Excite](#attend-and-excite) | ✅ | ❌ | |
+| [Semantic Guidance](#semantic-guidance) | ✅ | ❌ | |
+| [Self-attention Guidance](#self-attention-guidance) | ✅ | ❌ | |
+| [Depth2Image](#depth2image) | ✅ | ❌ | |
+| [MultiDiffusion Panorama](#multidiffusion-panorama) | ✅ | ❌ | |
+| [DreamBooth](#dreambooth) | ❌ | ✅ | |
+| [Textual Inversion](#textual-inversion) | ❌ | ✅ | |
+| [ControlNet](#controlnet) | ✅ | ❌ | A ControlNet can be
trained/fine-tuned on
a custom conditioning. |
+| [Prompt Weighting](#prompt-weighting) | ✅ | ❌ | |
+| [Custom Diffusion](#custom-diffusion) | ❌ | ✅ | |
+| [Model Editing](#model-editing) | ✅ | ❌ | |
+| [DiffEdit](#diffedit) | ✅ | ❌ | |
+| [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | |
## Instruct Pix2Pix
[Paper](https://arxiv.org/abs/2211.09800)
-[Instruct Pix2Pix](../api/pipelines/stable_diffusion/pix2pix) is fine-tuned from stable diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
+[Instruct Pix2Pix](../api/pipelines/stable_diffusion/pix2pix) is fine-tuned from stable diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
Instruct Pix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts.
See [here](../api/pipelines/stable_diffusion/pix2pix) for more information on how to use it.
@@ -79,13 +79,14 @@ See [here](../api/pipelines/stable_diffusion/pix2pix) for more information on ho
The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.
Pix2Pix Zero can be used both to edit synthetic images as well as real images.
+
- To edit synthetic images, one first generates an image given a caption.
-Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
+ Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
- To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies ddim inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
-Pix2Pix Zero is the first model that allows "zero-shot" image editing. This means that the model
+Pix2Pix Zero is the first model that allows "zero-shot" image editing. This means that the model
can edit an image in less than a minute on a consumer GPU as shown [here](../api/pipelines/stable_diffusion/pix2pix_zero#usage-example).
@@ -99,7 +100,7 @@ See [here](../api/pipelines/stable_diffusion/pix2pix_zero) for more information
[Paper](https://arxiv.org/abs/2301.13826)
-[Attend and Excite](../api/pipelines/stable_diffusion/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.
+[Attend and Excite](../api/pipelines/stable_diffusion/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.
A set of token indices are given as input, corresponding to the subjects in the prompt that need to be present in the image. During denoising, each token index is guaranteed to have a minimum attention threshold for at least one patch of the image. The intermediate latents are iteratively optimized during the denoising process to strengthen the attention of the most neglected subject token until the attention threshold is passed for all subject tokens.
@@ -115,7 +116,7 @@ SEGA allows applying or removing one or more concepts from an image. The strengt
Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively.
-Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.
+Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.
See [here](../api/pipelines/semantic_stable_diffusion) for more information on how to use it.
@@ -133,7 +134,7 @@ See [here](../api/pipelines/stable_diffusion/self_attention_guidance) for more i
[Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
-[Depth2Image](../pipelines/stable_diffusion_2#depthtoimage) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.
+[Depth2Image](../pipelines/stable_diffusion_2#depthtoimage) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.
It conditions on a monocular depth estimate of the original image.
@@ -176,53 +177,55 @@ See [here](../training/text_inversion) for more information on how to use it.
[Paper](https://arxiv.org/abs/2302.05543)
-[ControlNet](../api/pipelines/stable_diffusion/controlnet) is an auxiliary network which adds an extra condition.
-There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles,
+[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
+[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
+There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles,
depth maps, and semantic segmentations.
-See [here](../api/pipelines/stable_diffusion/controlnet) for more information on how to use it.
+See [here](../api/pipelines/controlnet) for more information on how to use it.
## Prompt Weighting
-Prompt weighting is a simple technique that puts more attention weight on certain parts of the text
-input.
+Prompt weighting is a simple technique that puts more attention weight on certain parts of the text
+input.
For a more in-detail explanation and examples, see [here](../using-diffusers/weighted_prompts).
-## Custom Diffusion
+## Custom Diffusion
-[Custom Diffusion](../training/custom_diffusion) only fine-tunes the cross-attention maps of a pre-trained
-text-to-image diffusion model. It also allows for additionally performing textual inversion. It supports
+[Custom Diffusion](../training/custom_diffusion) only fine-tunes the cross-attention maps of a pre-trained
+text-to-image diffusion model. It also allows for additionally performing textual inversion. It supports
multi-concept training by design. Like DreamBooth and Textual Inversion, Custom Diffusion is also used to
-teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the
+teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the
concept(s) of interest.
-
-For more details, check out our [official doc](../training/custom_diffusion).
-## Model Editing
+For more details, check out our [official doc](../training/custom_diffusion).
+
+## Model Editing
[Paper](https://arxiv.org/abs/2303.08084)
The [text-to-image model editing pipeline](../api/pipelines/stable_diffusion/model_editing) helps you mitigate some of the incorrect implicit assumptions a pre-trained text-to-image
diffusion model might make about the subjects present in the input prompt. For example, if you prompt Stable Diffusion to generate images for "A pack of roses", the roses in the generated images
-are more likely to be red. This pipeline helps you change that assumption.
+are more likely to be red. This pipeline helps you change that assumption.
To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).
-## DiffEdit
+## DiffEdit
[Paper](https://arxiv.org/abs/2210.11427)
-[DiffEdit](../api/pipelines/stable_diffusion/diffedit) allows for semantic editing of input images along with
-input prompts while preserving the original input images as much as possible.
+[DiffEdit](../api/pipelines/stable_diffusion/diffedit) allows for semantic editing of input images along with
+input prompts while preserving the original input images as much as possible.
To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).
+
## T2I-Adapter
[Paper](https://arxiv.org/abs/2302.08453)
[T2I-Adapter](../api/pipelines/stable_diffusion/adapter) is an auxiliary network which adds an extra condition.
-There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch,
+There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch,
depth maps, and semantic segmentations.
See [here](../api/pipelines/stable_diffusion/adapter) for more information on how to use it.