[Docs] Fix typos and update files at API's Pipelines page 2 (#5748)

* Fix typos, update, add Copyright info, and trim trailing whitespace * Update docs/source/en/api/pipelines/text_to_video_zero.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * 1 second is not a long video, but 6 seconds is * Update text_to_video_zero.md * Update text_to_video_zero.md * Update text_to_video_zero.md * Update wuerstchen.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2026-01-27 17:22:53 +03:00 · 2023-11-15 21:54:55 +03:00
parent 3ad4207d1f
commit ecbe27a07f
27 changed files with 219 additions and 190 deletions
--- a/docs/source/en/api/pipelines/paradigms.md
+++ b/docs/source/en/api/pipelines/paradigms.md
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.

 The abstract from the paper is:

-*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
+*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*

 The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❤️

@@ -26,17 +26,14 @@ This pipeline improves sampling speed by running denoising steps in parallel, at
 Therefore, it is better to call this pipeline when running on multiple GPUs. Otherwise, without enough GPU bandwidth
 sampling may be even slower than sequential sampling.

-The two parameters to play with are `parallel` (batch size) and `tolerance`. 
- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 
-(for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size
-may not fit in memory, and lower batch size gives less parallelism. 
- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. 
-If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.
+The two parameters to play with are `parallel` (batch size) and `tolerance`.
+- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 (for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size may not fit in memory, and lower batch size gives less parallelism.
+- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.

 For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`]
 by setting `parallel=80` and `tolerance=0.1`.

-🤗 Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts
+🤗 Diffusers offers [distributed inference support](../../training/distributed_inference) for generating multiple prompts
 in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs.

 <Tip>
--- a/docs/source/en/api/pipelines/pix2pix_zero.md
+++ b/docs/source/en/api/pipelines/pix2pix_zero.md
@@ -20,7 +20,7 @@ The abstract from the paper is:

 You can find additional information about Pix2Pix Zero on the [project page](https://pix2pixzero.github.io/),  [original codebase](https://github.com/pix2pixzero/pix2pix-zero), and try it out in a [demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo).

-## Tips 
+## Tips

 * The pipeline can be conditioned on real input images. Check out the code examples below to know more.
 * The pipeline exposes two arguments namely `source_embeds` and `target_embeds`
@@ -29,12 +29,11 @@ you wanted to translate from "cat" to "dog". In this case, the edit direction wi
 this in the pipeline, you simply have to set the embeddings related to the phrases including "cat" to
 `source_embeds` and "dog" to `target_embeds`. Refer to the code example below for more details.
 * When you're using this pipeline from a prompt, specify the _source_ concept in the prompt. Taking
-the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gough".
+the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gogh".
 * If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
    * Swap the `source_embeds` and `target_embeds`.
-    * Change the input prompt to include "dog".  
-* To learn more about how the source and target embeddings are generated, refer to the [original 
-paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings.
+    * Change the input prompt to include "dog".
+* To learn more about how the source and target embeddings are generated, refer to the [original paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings.
 * Note that the quality of the outputs generated with this pipeline is dependent on how good the `source_embeds` and `target_embeds` are. Please, refer to [this discussion](#generating-source-and-target-embeddings) for some suggestions on the topic.

 ## Available Pipelines:
@@ -79,23 +78,22 @@ for url in [src_embs_url, target_embs_url]:
 src_embeds = torch.load(src_embs_url.split("/")[-1])
 target_embeds = torch.load(target_embs_url.split("/")[-1])

-images = pipeline(
+image = pipeline(
    prompt,
    source_embeds=src_embeds,
    target_embeds=target_embeds,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
-).images
-images[0].save("edited_image_dog.png")
+).images[0]
+image
 ```

 ### Based on an input image

 When the pipeline is conditioned on an input image, we first obtain an inverted
-noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then 
-the inverted noise is used to start the generation process. 
+noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then the inverted noise is used to start the generation process.

-First, let's load our pipeline: 
+First, let's load our pipeline:

 ```py
 import torch
@@ -119,25 +117,25 @@ pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler
 pipeline.enable_model_cpu_offload()
 ```

-Then, we load an input image for conditioning and obtain a suitable caption for it: 
+Then, we load an input image for conditioning and obtain a suitable caption for it:

 ```py
-import requests
-from PIL import Image
+from diffusers.utils import load_image

 img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"
-raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512))
+raw_image = load_image(url).resize((512, 512))
 caption = pipeline.generate_caption(raw_image)
+caption
 ```

-Then we employ the generated caption and the input image to get the inverted noise: 
+Then we employ the generated caption and the input image to get the inverted noise:

-```py 
+```py
 generator = torch.manual_seed(0)
 inv_latents = pipeline.invert(caption, image=raw_image, generator=generator).latents
 ```

-Now, generate the image with edit directions: 
+Now, generate the image with edit directions:

 ```py
 # See the "Generating source and target embeddings" section below to
@@ -159,16 +157,16 @@ image = pipeline(
    latents=inv_latents,
    negative_prompt=caption,
 ).images[0]
-image.save("edited_image.png")
+image
 ```

-## Generating source and target embeddings 
+## Generating source and target embeddings

 The authors originally used the [GPT-3 API](https://openai.com/api/) to generate the source and target captions for discovering
 edit directions. However, we can also leverage open source and public models for the same purpose.
 Below, we provide an end-to-end example with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model
 for generating captions and [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for
-computing embeddings on the generated captions.  
+computing embeddings on the generated captions.

 **1. Load the generation model**:

@@ -180,7 +178,7 @@ tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
 model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)
 ```

-**2. Construct a starting prompt**: 
+**2. Construct a starting prompt**:

 ```py
 source_concept = "cat"
@@ -193,11 +191,11 @@ target_text = f"Provide a caption for images containing a {target_concept}. "
 "The captions should be in English and should be no longer than 150 characters."
 ```

-Here, we're interested in the "cat -> dog" direction. 
+Here, we're interested in the "cat -> dog" direction.

 **3. Generate captions**:

-We can use a utility like so for this purpose. 
+We can use a utility like so for this purpose.

 ```py
 def generate_captions(input_prompt):
@@ -214,17 +212,18 @@ And then we just call it to generate our captions:
 ```py
 source_captions = generate_captions(source_text)
 target_captions = generate_captions(target_concept)
+print(source_captions, target_captions, sep='\n')
 ```

 We encourage you to play around with the different parameters supported by the
 `generate()` method ([documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.generation_tf_utils.TFGenerationMixin.generate)) for the generation quality you are looking for.

-**4. Load the embedding model**: 
+**4. Load the embedding model**:

 Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model.

-```py 
-from diffusers import StableDiffusionPix2PixZeroPipeline 
+```py
+from diffusers import StableDiffusionPix2PixZeroPipeline

 pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
@@ -236,8 +235,8 @@ text_encoder = pipeline.text_encoder

 **5. Compute embeddings**:

-```py 
-import torch 
+```py
+import torch

 def embed_captions(sentences, tokenizer, text_encoder, device="cuda"):
    with torch.no_grad():
@@ -261,23 +260,29 @@ target_embeddings = embed_captions(target_captions, tokenizer, text_encoder)

 And you're done! [Here](https://colab.research.google.com/drive/1tz2C1EdfZYAPlzXXbTnf-5PRBiR8_R1F?usp=sharing) is a Colab Notebook that you can use to interact with the entire process.

-Now, you can use these embeddings directly while calling the pipeline: 
+Now, you can use these embeddings directly while calling the pipeline:

 ```py
 from diffusers import DDIMScheduler

 pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)

-images = pipeline(
+image = pipeline(
    prompt,
    source_embeds=source_embeddings,
    target_embeds=target_embeddings,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
-).images
-images[0].save("edited_image_dog.png")
+).images[0]
+image
 ```

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## StableDiffusionPix2PixZeroPipeline
 [[autodoc]] StableDiffusionPix2PixZeroPipeline
 	- __call__
--- a/docs/source/en/api/pipelines/pixart.md
+++ b/docs/source/en/api/pipelines/pixart.md
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# PixArt
+# PixArt-α

 ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png)

@@ -24,13 +24,20 @@ You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github

 Some notes about this pipeline:

-* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit.md).
-* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. 
+* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit).
+* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
 * It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py).
 * It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them.

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## PixArtAlphaPipeline

 [[autodoc]] PixArtAlphaPipeline
 	- all
-	- __call__
+	- __call__
+	
--- a/docs/source/en/api/pipelines/pndm.md
+++ b/docs/source/en/api/pipelines/pndm.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # PNDM

-[Pseudo Numerical methods for Diffusion Models on manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
+[Pseudo Numerical Methods for Diffusion Models on Manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.

 The abstract from the paper is:

@@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
 	- __call__

 ## ImagePipelineOutput
-[[autodoc]] pipelines.ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/score_sde_ve.md
+++ b/docs/source/en/api/pipelines/score_sde_ve.md
@@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
 	- __call__

 ## ImagePipelineOutput
-[[autodoc]] pipelines.ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/self_attention_guidance.md
+++ b/docs/source/en/api/pipelines/self_attention_guidance.md
@@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
 	- all

 ## StableDiffusionOutput
-[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/semantic_stable_diffusion.md
+++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.md
@@ -12,12 +12,12 @@ specific language governing permissions and limitations under the License.

 # Semantic Guidance

-Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation.
+Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation.
 Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition.

 The abstract from the paper is:

-*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.*
+*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.*

 <Tip>

--- a/docs/source/en/api/pipelines/shap_e.md
+++ b/docs/source/en/api/pipelines/shap_e.md
@@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License.

 # Shap-E

-The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
+The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewoo Jun from [OpenAI](https://github.com/openai).

 The abstract from the paper is:

@@ -34,4 +34,4 @@ See the [reuse components across pipelines](../../using-diffusers/loading#reuse-
 	- __call__

 ## ShapEPipelineOutput
-[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput
+[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput
--- a/docs/source/en/api/pipelines/spectrogram_diffusion.md
+++ b/docs/source/en/api/pipelines/spectrogram_diffusion.md
@@ -34,4 +34,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
 	- __call__

 ## AudioPipelineOutput
-[[autodoc]] pipelines.AudioPipelineOutput
+[[autodoc]] pipelines.AudioPipelineOutput
--- a/docs/source/en/api/pipelines/stable_diffusion/adapter.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/adapter.md
@@ -20,7 +20,7 @@ Using the pretrained models we can provide control images (for example, a depth

 The abstract of the paper is the following:

-*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*
+*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*

 This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ .

@@ -33,7 +33,7 @@ This model was contributed by the community contributor [HimariO](https://github

 ## Usage example with the base model of StableDiffusion-1.4/1.5

-In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
+In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
 All adapters use the same pipeline.

 1. Images are first converted into the appropriate *control image* format.
@@ -42,7 +42,7 @@ All adapters use the same pipeline.
 Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1).

 ```python
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid

 image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png")
 ```
@@ -83,20 +83,21 @@ Finally, pass the prompt and control image to the pipeline

 ```py
 # fix the random seed, so you will get the same result as the example
-generator = torch.manual_seed(7)
+generator = torch.Generator("cuda").manual_seed(7)

 out_image = pipe(
    "At night, glowing cubes in front of the beach",
    image=color_palette,
    generator=generator,
 ).images[0]
+make_image_grid([image, color_palette, out_image], rows=1, cols=3)
 ```

 ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png)

 ## Usage example with the base model of StableDiffusion-XL

-In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
+In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
 All adapters use the same pipeline.

 1. Images are first downloaded into the appropriate *control image* format.
@@ -105,7 +106,7 @@ All adapters use the same pipeline.
 Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).

 ```python
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid

 sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
 ```
@@ -121,10 +122,9 @@ from diffusers import (
    StableDiffusionXLAdapterPipeline,
    DDPMScheduler
 )
-from diffusers.models.unet_2d_condition import UNet2DConditionModel

 model_id = "stabilityai/stable-diffusion-xl-base-1.0"
-adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl")
+adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0", torch_dtype=torch.float16, adapter_type="full_adapter_xl")
 scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")

 pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
@@ -141,12 +141,13 @@ Finally, pass the prompt and control image to the pipeline
 generator = torch.Generator().manual_seed(42)

 sketch_image_out = pipe(
-    prompt="a photo of a dog in real world, high quality", 
-    negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality", 
-    image=sketch_image, 
-    generator=generator, 
+    prompt="a photo of a dog in real world, high quality",
+    negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
+    image=sketch_image,
+    generator=generator,
    guidance_scale=7.5
 ).images[0]
+make_image_grid([sketch_image, sketch_image_out], rows=1, cols=2)
 ```

 ![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png)
@@ -159,7 +160,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu

 | Model Name | Control Image Overview| Control Image Example | Generated Image Example |
 |---|---|---|---|
-|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
+|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | An image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
 |[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"/></a>|
 |[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"/></a>|
 |[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)<br/> *Trained with Midas depth estimation*  | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"/></a>|
@@ -181,9 +182,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
 Here we use the keypose adapter for the character posture and the depth adapter for creating the scene.

 ```py
-import torch
-from PIL import Image
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid

 cond_keypose = load_image(
    "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
@@ -191,7 +190,7 @@ cond_keypose = load_image(
 cond_depth = load_image(
    "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
 )
-cond = [[cond_keypose, cond_depth]]
+cond = [cond_keypose, cond_depth]

 prompt = ["A man walking in an office room with a nice view"]
 ```
@@ -202,12 +201,13 @@ The two control images look as such:
 ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png)


-`MultiAdapter` combines keypose and depth adapters. 
+`MultiAdapter` combines keypose and depth adapters.

 `adapter_conditioning_scale` balances the relative influence of the different adapters.

 ```py
-from diffusers import StableDiffusionAdapterPipeline, MultiAdapter
+import torch
+from diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter

 adapters = MultiAdapter(
    [
@@ -221,19 +221,20 @@ pipe = StableDiffusionAdapterPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16,
    adapter=adapters,
-)
+).to("cuda")

-images = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8])
+image = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]).images[0]
+make_image_grid([cond_keypose, cond_depth, image], rows=1, cols=3)
 ```

 ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_depth_sample_output.png)


-## T2I Adapter vs ControlNet
+## T2I-Adapter vs ControlNet

-T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet). 
-T2i-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process. 
-However, T2I-Adapter performs slightly worse than ControlNet. 
+T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet).
+T2I-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process.
+However, T2I-Adapter performs slightly worse than ControlNet.

 ## StableDiffusionAdapterPipeline
 [[autodoc]] StableDiffusionAdapterPipeline
--- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
@@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License.

 # Depth-to-image

-The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. 
+The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure.

 <Tip>

-Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! 
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!

 If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!

@@ -37,4 +37,4 @@ If you're interested in using one of the official checkpoints for a task, explor

 ## StableDiffusionPipelineOutput

-[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
@@ -23,7 +23,7 @@ text-to-image Stable Diffusion checkpoints, such as

 <Tip>

-Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! 
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!

 If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!

@@ -54,4 +54,4 @@ If you're interested in using one of the official checkpoints for a task, explor

 ## FlaxStableDiffusionPipelineOutput

-[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md
@@ -16,7 +16,7 @@ The Stable Diffusion latent upscaler model was created by [Katherine Crowson](ht

 <Tip>

-Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! 
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!

 If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!

@@ -35,4 +35,4 @@ If you're interested in using one of the official checkpoints for a task, explor

 ## StableDiffusionPipelineOutput

-[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/stable_diffusion/overview.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/overview.md
@@ -34,7 +34,7 @@ The table below summarizes the available Stable Diffusion pipelines, their suppo
            Supported tasks
            </th>
            <th class="px-4 py-2 font-medium text-gray-900 text-left">
-            Space
+            🤗 Space
            </th>
        </tr>
        </thead>
@@ -165,4 +165,4 @@ img2img = StableDiffusionImg2ImgPipeline(**text2img.components)
 inpaint = StableDiffusionInpaintPipeline(**text2img.components)

 # now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline
-```
+```
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
@@ -14,12 +14,12 @@ specific language governing permissions and limitations under the License.

 Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/).

-*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. 
+*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels.
 These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).*

 For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release).

-The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler.
+The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps.

 Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image:

@@ -35,7 +35,7 @@ Here are some examples for how to use Stable Diffusion 2 for each task:

 <Tip>

-Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! 
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!

 If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!

@@ -55,30 +55,21 @@ pipe = pipe.to("cuda")

 prompt = "High quality photo of an astronaut riding a horse in space"
 image = pipe(prompt, num_inference_steps=25).images[0]
-image.save("astronaut.png")
+image
 ```

 ## Inpainting

 ```py
-import PIL
-import requests
 import torch
-from io import BytesIO
-
 from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
-
-
-def download_image(url):
-    response = requests.get(url)
-    return PIL.Image.open(BytesIO(response.content)).convert("RGB")
-
+from diffusers.utils import load_image, make_image_grid

 img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
 mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

-init_image = download_image(img_url).resize((512, 512))
-mask_image = download_image(mask_url).resize((512, 512))
+init_image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))

 repo_id = "stabilityai/stable-diffusion-2-inpainting"
 pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
@@ -88,17 +79,14 @@ pipe = pipe.to("cuda")

 prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
 image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
-
-image.save("yellow_cat.png")
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
 ```

 ## Super-resolution

 ```py
-import requests
-from PIL import Image
-from io import BytesIO
 from diffusers import StableDiffusionUpscalePipeline
+from diffusers.utils import load_image, make_image_grid
 import torch

 # load model and scheduler
@@ -108,22 +96,19 @@ pipeline = pipeline.to("cuda")

 # let's download an  image
 url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
-response = requests.get(url)
-low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
+low_res_img = load_image(url)
 low_res_img = low_res_img.resize((128, 128))
 prompt = "a white cat"
 upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
-upscaled_image.save("upsampled_cat.png")
+make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2)
 ```

 ## Depth-to-image

 ```py
 import torch
-import requests
-from PIL import Image
-
 from diffusers import StableDiffusionDepth2ImgPipeline
+from diffusers.utils import load_image, make_image_grid

 pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth",
@@ -132,8 +117,9 @@ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(


 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-init_image = Image.open(requests.get(url, stream=True).raw)
+init_image = load_image(url)
 prompt = "two tigers"
-n_propmt = "bad, deformed, ugly, bad anotomy"
-image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0]
-```
+negative_prompt = "bad, deformed, ugly, bad anotomy"
+image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -23,7 +23,7 @@ The abstract from the paper is:
 - Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers:
 	- set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality
 	- set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE)
- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
+- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
 - SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders.
 - SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
 - SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.
@@ -32,7 +32,7 @@ The abstract from the paper is:

 To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide.

-Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! 
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints!

 </Tip>

--- a/docs/source/en/api/pipelines/stable_diffusion/text2img.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.md
@@ -20,7 +20,7 @@ The abstract from the paper is:

 <Tip>

-Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! 
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!

 If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!

@@ -56,4 +56,4 @@ If you're interested in using one of the official checkpoints for a task, explor

 ## FlaxStableDiffusionPipelineOutput

-[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/stable_diffusion/upscale.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.md
@@ -16,7 +16,7 @@ The Stable Diffusion upscaler diffusion model was created by the researchers and

 <Tip>

-Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! 
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!

 If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!

@@ -34,4 +34,4 @@ If you're interested in using one of the official checkpoints for a task, explor

 ## StableDiffusionPipelineOutput

-[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/stable_unclip.md
+++ b/docs/source/en/api/pipelines/stable_unclip.md
@@ -22,12 +22,10 @@ The abstract from the paper is:

 ## Tips

-Stable unCLIP takes  `noise_level` as input during inference which determines how much noise is added 
-to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, 
-we do not add any additional noise to the image embeddings (`noise_level = 0`).
+Stable unCLIP takes  `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, we do not add any additional noise to the image embeddings (`noise_level = 0`).

 ### Text-to-Image Generation
-Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha)
+Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha):

 ```python
 import torch
@@ -60,12 +58,12 @@ pipe = StableUnCLIPPipeline.from_pretrained(
 pipe = pipe.to("cuda")
 wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular"

-images = pipe(prompt=wave_prompt).images
-images[0].save("waves.png")
+image = pipe(prompt=wave_prompt).images[0]
+image
 ```
 <Tip warning={true}>

-For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. 
+For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use.

 </Tip>

@@ -90,12 +88,19 @@ images[0].save("variation_image.png")

 Optionally, you can also pass a prompt to `pipe` such as:

-```python 
+```python
 prompt = "A fantasy landscape, trending on artstation"

-images = pipe(init_image, prompt=prompt).images
-images[0].save("variation_image_two.png")
+image = pipe(init_image, prompt=prompt).images[0]
+image
 ```
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## StableUnCLIPPipeline

 [[autodoc]] StableUnCLIPPipeline
@@ -108,7 +113,6 @@ images[0].save("variation_image_two.png")
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention

-
 ## StableUnCLIPImg2ImgPipeline

 [[autodoc]] StableUnCLIPImg2ImgPipeline
@@ -120,6 +124,6 @@ images[0].save("variation_image_two.png")
 	- disable_vae_slicing
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention
-    
+
 ## ImagePipelineOutput
-[[autodoc]] pipelines.ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/stochastic_karras_ve.md
+++ b/docs/source/en/api/pipelines/stochastic_karras_ve.md
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.

 The abstract from the paper:

-*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.*
+*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after re-training with our proposed improvements to a new SOTA of 1.36.*

 <Tip>

@@ -30,4 +30,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
 	- __call__

 ## ImagePipelineOutput
-[[autodoc]] pipelines.ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/text_to_video.md
+++ b/docs/source/en/api/pipelines/text_to_video.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 <Tip warning={true}>

-🧪 This pipeline is for research purposes only. 
+🧪 This pipeline is for research purposes only.

 </Tip>

@@ -26,13 +26,13 @@ The abstract from the paper is:

 You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense).

-## Usage example 
+## Usage example

 ### `text-to-video-ms-1.7b`

 Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):

-```python 
+```python
 import torch
 from diffusers import DiffusionPipeline
 from diffusers.utils import export_to_video
@@ -88,7 +88,7 @@ video_path = export_to_video(video_frames)
 video_path
 ```

-Here are some sample outputs: 
+Here are some sample outputs:

 <table>
    <tr>
@@ -118,8 +118,9 @@ which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zero

 ```py
 import torch
-from diffusers import DiffusionPipeline
+from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
 from diffusers.utils import export_to_video
+from PIL import Image

 pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
 pipe.enable_model_cpu_offload()
@@ -152,7 +153,7 @@ video_path = export_to_video(video_frames)
 video_path
 ```

-Here are some sample outputs: 
+Here are some sample outputs:

 <table>
    <tr>
@@ -166,6 +167,12 @@ Here are some sample outputs:
    </tr>
 </table>

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## TextToVideoSDPipeline
 [[autodoc]] TextToVideoSDPipeline
 	- all
--- a/docs/source/en/api/pipelines/text_to_video_zero.md
+++ b/docs/source/en/api/pipelines/text_to_video_zero.md
@@ -12,12 +12,7 @@ specific language governing permissions and limitations under the License.

 # Text2Video-Zero

-[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by
-Levon Khachatryan,
-Andranik Movsisyan,
-Vahram Tadevosyan,
-Roberto Henschel,
-[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).
+[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).

 Text2Video-Zero enables zero-shot video generation using either:
 1. A textual prompt
@@ -35,16 +30,15 @@ Our key modifications include (i) enriching the latent codes of the generated fr
 Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
 As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.*

-You can find additional information about Text-to-Video Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero).
+You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero).

 ## Usage example

 ### Text-To-Video

-To generate a video from prompt, run the following python command
+To generate a video from prompt, run the following Python code:
 ```python
 import torch
-import imageio
 from diffusers import TextToVideoZeroPipeline

 model_id = "runwayml/stable-diffusion-v1-5"
@@ -63,18 +57,17 @@ You can change these parameters in the pipeline call:
 * Video length:
    * `video_length`, the number of frames video_length to be generated. Default: `video_length=8`

-We an also generate longer videos by doing the processing in a chunk-by-chunk manner:
+We can also generate longer videos by doing the processing in a chunk-by-chunk manner:
 ```python
 import torch
-import imageio
 from diffusers import TextToVideoZeroPipeline
 import numpy as np

 model_id = "runwayml/stable-diffusion-v1-5"
 pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
 seed = 0
-video_length = 8
-chunk_size = 4
+video_length = 24  #24 ÷ 4fps = 6 seconds
+chunk_size = 8
 prompt = "A panda is playing guitar on times square"

 # Generate the video chunk-by-chunk
@@ -122,7 +115,7 @@ To generate a video from prompt with additional pose control
    frame_count = 8
    pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
    ```
-    To extract pose from actual video, read [ControlNet documentation](./stable_diffusion/controlnet).
+    To extract pose from actual video, read [ControlNet documentation](controlnet).

 3. Run `StableDiffusionControlNetPipeline` with our custom attention processor

@@ -152,13 +145,12 @@ To generate a video from prompt with additional pose control

 ### Text-To-Video with Edge Control

-To generate a video from prompt with additional pose control,
-follow the steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny).
+To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny).


 ### Video Instruct-Pix2Pix

-To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/pix2pix)):
+To perform text-guided video editing (with [InstructPix2Pix](pix2pix)):

 1. Download a demo video

@@ -196,12 +188,12 @@ To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/
    ```


-### DreamBooth specialization 
+### DreamBooth specialization

 Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control**
-can run with custom [DreamBooth](../training/dreambooth) models, as shown below for
+can run with custom [DreamBooth](../../training/dreambooth) models, as shown below for
 [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and
-[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model
+[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model:

 1. Download a demo video

@@ -250,6 +242,11 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below

 You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth).

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>

 ## TextToVideoZeroPipeline
 [[autodoc]] TextToVideoZeroPipeline
@@ -257,4 +254,4 @@ You can filter out some available DreamBooth-trained models with [this link](htt
 	- __call__

 ## TextToVideoPipelineOutput
-[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput
+[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput
--- a/docs/source/en/api/pipelines/unclip.md
+++ b/docs/source/en/api/pipelines/unclip.md
@@ -9,13 +9,13 @@ specific language governing permissions and limitations under the License.

 # unCLIP

-[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)).
+[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo](https://github.com/kakaobrain/karlo).

 The abstract from the paper is following:

 *Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*

-You can find lucidrains DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch).
+You can find lucidrains' DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch).

 <Tip>

--- a/docs/source/en/api/pipelines/unidiffuser.md
+++ b/docs/source/en/api/pipelines/unidiffuser.md
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.

 The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu.

-The abstract from the [paper](https://arxiv.org/abs/2303.06555) is:
+The abstract from the paper is:

 *This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).*

@@ -54,7 +54,7 @@ image.save("unidiffuser_joint_sample_image.png")
 print(text)
 ```

-This is also called "joint" generation in the UniDiffusers paper, since we are sampling from the joint image-text distribution.
+This is also called "joint" generation in the UniDiffuser paper, since we are sampling from the joint image-text distribution.

 Note that the generation task is inferred from the inputs used when calling the pipeline.
 It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]:
@@ -65,7 +65,7 @@ pipe.set_joint_mode()
 sample = pipe(num_inference_steps=20, guidance_scale=8.0)
 ```

-When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting the infer the mode.
+When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting to infer the mode.
 You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode.

 You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively):
@@ -100,7 +100,7 @@ prompt = "an elephant under the sea"

 sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
 t2i_image = sample.images[0]
-t2i_image.save("unidiffuser_text2img_sample_image.png")
+t2i_image
 ```

 The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`].
@@ -133,7 +133,7 @@ The `img2text` mode requires that an input `image` be supplied. You can set the

 ### Image Variation

-The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and the perform a text-to-image generation on the outputs of the first generation.
+The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and then perform a text-to-image generation on the outputs of the first generation.
 This produces a new image which is semantically similar to the input image:

 ```python
@@ -147,7 +147,7 @@ model_id_or_path = "thu-ml/unidiffuser-v1"
 pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
 pipe.to(device)

-# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
+# Image variation can be performed with an image-to-text generation followed by a text-to-image generation:
 # 1. Image-to-text generation
 image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
 init_image = load_image(image_url).resize((512, 512))
@@ -164,7 +164,6 @@ final_image.save("unidiffuser_image_variation_sample.png")

 ### Text Variation

-
 Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation:

 ```python
@@ -191,10 +190,16 @@ final_prompt = sample.text[0]
 print(final_prompt)
 ```

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## UniDiffuserPipeline
 [[autodoc]] UniDiffuserPipeline
 	- all
 	- __call__

 ## ImageTextPipelineOutput
-[[autodoc]] pipelines.ImageTextPipelineOutput
+[[autodoc]] pipelines.ImageTextPipelineOutput
--- a/docs/source/en/api/pipelines/value_guided_sampling.md
+++ b/docs/source/en/api/pipelines/value_guided_sampling.md
@@ -22,11 +22,17 @@ This pipeline is based on the [Planning with Diffusion for Flexible Behavior Syn

 The abstract from the paper is:

-*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility*.
+*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.*

-You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb). 
+You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb).

 The script to run the model is available [here](https://github.com/huggingface/diffusers/tree/main/examples/reinforcement_learning).

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## ValueGuidedRLPipeline
-[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline
+[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline
--- a/docs/source/en/api/pipelines/versatile_diffusion.md
+++ b/docs/source/en/api/pipelines/versatile_diffusion.md
@@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License.

 # Versatile Diffusion

-Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi .
+Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi.

 The abstract from the paper is:

-*The recent advances in diffusion models have set an impressive milestone in many generation tasks. Trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest in academia and industry. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-flow network, dubbed Versatile Diffusion (VD), that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model. Moreover, we generalize VD to a unified multi-flow multimodal diffusion framework with grouped layers, swappable streams, and other propositions that can process modalities beyond images and text. Through our experiments, we demonstrate that VD and its underlying framework have the following merits: a) VD handles all subtasks with competitive quality; b) VD initiates novel extensions and applications such as disentanglement of style and semantic, image-text dual-guided generation, etc.; c) Through these experiments and applications, VD provides more semantic insights of the generated outputs.*
+*Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research.*

 ## Tips

--- a/docs/source/en/api/pipelines/wuerstchen.md
+++ b/docs/source/en/api/pipelines/wuerstchen.md
@@ -1,15 +1,27 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
 # Würstchen

 <img src="https://github.com/dome272/Wuerstchen/assets/61938694/0617c863-165a-43ee-9303-2a17299a0cf9">

-[Würstchen: Efficient Pretraining of Text-to-Image Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.
+[Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.

 The abstract from the paper is:

-*We introduce Würstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.*
+*We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.*

 ## Würstchen Overview
-Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637) ). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.
+Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637)). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.

 ## Würstchen v2 comes to Diffusers

@@ -21,7 +33,7 @@ After the initial paper release, we have improved numerous things in the archite
 - Better quality


-We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are: 
+We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are:

 - v2-base
 - v2-aesthetic
@@ -45,7 +57,7 @@ pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dty

 caption = "Anthropomorphic cat dressed as a fire fighter"
 images = pipe(
-    caption, 
+    caption,
    width=1024,
    height=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
@@ -90,7 +102,8 @@ decoder_output = decoder_pipeline(
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
-).images
+).images[0]
+decoder_output
 ```

 ## Speed-Up Inference
@@ -113,6 +126,7 @@ after 1024x1024 is 1152x1152

 The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen).

+
 ## WuerstchenCombinedPipeline

 [[autodoc]] WuerstchenCombinedPipeline
@@ -139,8 +153,8 @@ The original codebase, as well as experimental ideas, can be found at [dome272/W

 ```bibtex
      @misc{pernias2023wuerstchen,
-            title={Wuerstchen: Efficient Pretraining of Text-to-Image Models}, 
-            author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville},
+            title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
+            author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville},
            year={2023},
            eprint={2306.00637},
            archivePrefix={arXiv},