diff --git a/docs/source/en/api/activations.md b/docs/source/en/api/activations.md index 684238420c..e4f4567cac 100644 --- a/docs/source/en/api/activations.md +++ b/docs/source/en/api/activations.md @@ -1,3 +1,15 @@ + + # Activation functions Customized activation functions for supporting various models in ๐Ÿค— Diffusers. @@ -12,4 +24,4 @@ Customized activation functions for supporting various models in ๐Ÿค— Diffusers. ## ApproximateGELU -[[autodoc]] models.activations.ApproximateGELU \ No newline at end of file +[[autodoc]] models.activations.ApproximateGELU diff --git a/docs/source/en/api/attnprocessor.md b/docs/source/en/api/attnprocessor.md index 0b11c1f5bc..f6ee09f124 100644 --- a/docs/source/en/api/attnprocessor.md +++ b/docs/source/en/api/attnprocessor.md @@ -1,3 +1,15 @@ + + # Attention Processor An attention processor is a class for applying different types of attention mechanisms. diff --git a/docs/source/en/api/image_processor.md b/docs/source/en/api/image_processor.md index 7fc66f5ee6..fb446c944c 100644 --- a/docs/source/en/api/image_processor.md +++ b/docs/source/en/api/image_processor.md @@ -12,9 +12,9 @@ specific language governing permissions and limitations under the License. # VAE Image Processor -The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]'s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. +The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. -All pipelines with [`VaeImageProcessor`] accepts PIL Image, PyTorch tensor, or NumPy arrays as image inputs and returns outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="pt"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines. +All pipelines with [`VaeImageProcessor`] accept PIL Image, PyTorch tensor, or NumPy arrays as image inputs and return outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="latent"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines. ## VaeImageProcessor @@ -24,4 +24,4 @@ All pipelines with [`VaeImageProcessor`] accepts PIL Image, PyTorch tensor, or N The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs. -[[autodoc]] image_processor.VaeImageProcessorLDM3D \ No newline at end of file +[[autodoc]] image_processor.VaeImageProcessorLDM3D diff --git a/docs/source/en/api/internal_classes_overview.md b/docs/source/en/api/internal_classes_overview.md index 421a22d5ce..5c8d2cc0e3 100644 --- a/docs/source/en/api/internal_classes_overview.md +++ b/docs/source/en/api/internal_classes_overview.md @@ -1,3 +1,15 @@ + + # Overview The APIs in this section are more experimental and prone to breaking changes. Most of them are used internally for development, but they may also be useful to you if you're interested in building a diffusion model with some custom parts or if you're interested in some of our helper utilities for working with ๐Ÿค— Diffusers. diff --git a/docs/source/en/api/loaders.md b/docs/source/en/api/loaders.md index 5c7c3ef660..d81b0eb1ab 100644 --- a/docs/source/en/api/loaders.md +++ b/docs/source/en/api/loaders.md @@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License. # Loaders -Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusion model to generate images in a specific style without training or finetuning the entire model. The adapter weights are typically only a tiny fraction of the pretrained model's which making them very portable. ๐Ÿค— Diffusers provides an easy-to-use `LoaderMixin` API to load adapter weights. +Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusion model to generate images in a specific style without training or finetuning the entire model. The adapter weights are very portable because they're typically only a tiny fraction of the pretrained model weights. ๐Ÿค— Diffusers provides an easy-to-use `LoaderMixin` API to load adapter weights. -๐Ÿงช The `LoaderMixins` are highly experimental and prone to future changes. To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`. +๐Ÿงช The `LoaderMixin`s are highly experimental and prone to future changes. To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`. diff --git a/docs/source/en/api/logging.md b/docs/source/en/api/logging.md index cc2d012691..b31b7c1175 100644 --- a/docs/source/en/api/logging.md +++ b/docs/source/en/api/logging.md @@ -51,7 +51,7 @@ logger.warning("WARN") All methods of the logging module are documented below. The main methods are [`logging.get_verbosity`] to get the current level of verbosity in the logger and -[`logging.set_verbosity`] to set the verbosity to the level of your choice. +[`logging.set_verbosity`] to set the verbosity to the level of your choice. In order from the least verbose to the most verbose: diff --git a/docs/source/en/api/models/asymmetricautoencoderkl.md b/docs/source/en/api/models/asymmetricautoencoderkl.md index c7b3ee9b51..1e102943c5 100644 --- a/docs/source/en/api/models/asymmetricautoencoderkl.md +++ b/docs/source/en/api/models/asymmetricautoencoderkl.md @@ -1,3 +1,15 @@ + + # AsymmetricAutoencoderKL Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://arxiv.org/abs/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua. @@ -6,7 +18,7 @@ The abstract from the paper is: *StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN* -Evaluation results can be found in section 4.1 of the original paper. +Evaluation results can be found in section 4.1 of the original paper. ## Available checkpoints @@ -16,30 +28,23 @@ Evaluation results can be found in section 4.1 of the original paper. ## Example Usage ```python -from io import BytesIO -from PIL import Image -import requests from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline +from diffusers.utils import load_image, make_image_grid -def download_image(url: str) -> Image.Image: - response = requests.get(url) - return Image.open(BytesIO(response.content)).convert("RGB") - - -prompt = "a photo of a person" +prompt = "a photo of a person with beard" img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png" mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png" -image = download_image(img_url).resize((256, 256)) -mask_image = download_image(mask_url).resize((256, 256)) +original_image = load_image(img_url).resize((512, 512)) +mask_image = load_image(mask_url).resize((512, 512)) pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting") pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5") pipe.to("cuda") -image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0] -image.save("image.jpeg") +image = pipe(prompt=prompt, image=original_image, mask_image=mask_image).images[0] +make_image_grid([original_image, mask_image, image], rows=1, cols=3) ``` ## AsymmetricAutoencoderKL diff --git a/docs/source/en/api/models/autoencoder_tiny.md b/docs/source/en/api/models/autoencoder_tiny.md index 9b97b6e8e9..1d19539bff 100644 --- a/docs/source/en/api/models/autoencoder_tiny.md +++ b/docs/source/en/api/models/autoencoder_tiny.md @@ -1,6 +1,18 @@ + + # Tiny AutoEncoder -Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in [madebyollin/taesd](https://github.com/madebyollin/taesd) by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion's VAE that can quickly decode the latents in a [`StableDiffusionPipeline`] or [`StableDiffusionXLPipeline`] almost instantly. +Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in [madebyollin/taesd](https://github.com/madebyollin/taesd) by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion's VAE that can quickly decode the latents in a [`StableDiffusionPipeline`] or [`StableDiffusionXLPipeline`] almost instantly. To use with Stable Diffusion v-2.1: @@ -16,7 +28,7 @@ pipe = pipe.to("cuda") prompt = "slice of delicious New York-style berry cheesecake" image = pipe(prompt, num_inference_steps=25).images[0] -image.save("cheesecake.png") +image ``` To use with Stable Diffusion XL 1.0 @@ -33,7 +45,7 @@ pipe = pipe.to("cuda") prompt = "slice of delicious New York-style berry cheesecake" image = pipe(prompt, num_inference_steps=25).images[0] -image.save("cheesecake_sdxl.png") +image ``` ## AutoencoderTiny @@ -42,4 +54,4 @@ image.save("cheesecake_sdxl.png") ## AutoencoderTinyOutput -[[autodoc]] models.autoencoder_tiny.AutoencoderTinyOutput \ No newline at end of file +[[autodoc]] models.autoencoder_tiny.AutoencoderTinyOutput diff --git a/docs/source/en/api/models/autoencoderkl.md b/docs/source/en/api/models/autoencoderkl.md index bc709c422d..f42a4d2941 100644 --- a/docs/source/en/api/models/autoencoderkl.md +++ b/docs/source/en/api/models/autoencoderkl.md @@ -1,3 +1,15 @@ + + # AutoencoderKL The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in ๐Ÿค— Diffusers to encode images into latents and to decode latent representations into images. @@ -14,7 +26,7 @@ from the original format using [`FromOriginalVAEMixin.from_single_file`] as foll ```py from diffusers import AutoencoderKL -url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be local file +url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be a local file model = AutoencoderKL.from_single_file(url) ``` diff --git a/docs/source/en/api/models/controlnet.md b/docs/source/en/api/models/controlnet.md index 58359723a0..12bc0110f2 100644 --- a/docs/source/en/api/models/controlnet.md +++ b/docs/source/en/api/models/controlnet.md @@ -1,10 +1,22 @@ + + # ControlNet -The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. +The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. The abstract from the paper is: -*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.* +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* ## Loading from the original format diff --git a/docs/source/en/api/models/overview.md b/docs/source/en/api/models/overview.md index 9887c6f757..ab8d9d4e78 100644 --- a/docs/source/en/api/models/overview.md +++ b/docs/source/en/api/models/overview.md @@ -1,8 +1,20 @@ + + # Models -๐Ÿค— Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\). +๐Ÿค— Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\). -All models are built from the base [`ModelMixin`] class which is a [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub. +All models are built from the base [`ModelMixin`] class which is a [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub. ## ModelMixin [[autodoc]] ModelMixin @@ -13,4 +25,4 @@ All models are built from the base [`ModelMixin`] class which is a [`torch.nn.mo ## PushToHubMixin -[[autodoc]] utils.PushToHubMixin \ No newline at end of file +[[autodoc]] utils.PushToHubMixin diff --git a/docs/source/en/api/models/prior_transformer.md b/docs/source/en/api/models/prior_transformer.md index 1d2b799ed3..0b849c3006 100644 --- a/docs/source/en/api/models/prior_transformer.md +++ b/docs/source/en/api/models/prior_transformer.md @@ -1,7 +1,18 @@ + + # Prior Transformer -The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents -](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process. +The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process. The abstract from the paper is: @@ -13,4 +24,4 @@ The abstract from the paper is: ## PriorTransformerOutput -[[autodoc]] models.prior_transformer.PriorTransformerOutput \ No newline at end of file +[[autodoc]] models.prior_transformer.PriorTransformerOutput diff --git a/docs/source/en/api/models/transformer2d.md b/docs/source/en/api/models/transformer2d.md index 4ad2b00b6f..0f891edd75 100644 --- a/docs/source/en/api/models/transformer2d.md +++ b/docs/source/en/api/models/transformer2d.md @@ -1,3 +1,15 @@ + + # Transformer2D A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs. diff --git a/docs/source/en/api/models/transformer_temporal.md b/docs/source/en/api/models/transformer_temporal.md index d67cf717f9..c936270b79 100644 --- a/docs/source/en/api/models/transformer_temporal.md +++ b/docs/source/en/api/models/transformer_temporal.md @@ -1,3 +1,15 @@ + + # Transformer Temporal A Transformer model for video-like data. @@ -8,4 +20,4 @@ A Transformer model for video-like data. ## TransformerTemporalModelOutput -[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput \ No newline at end of file +[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput diff --git a/docs/source/en/api/models/unet-motion.md b/docs/source/en/api/models/unet-motion.md index 07d4df64c3..cbc8c30ff6 100644 --- a/docs/source/en/api/models/unet-motion.md +++ b/docs/source/en/api/models/unet-motion.md @@ -1,3 +1,15 @@ + + # UNetMotionModel The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model. diff --git a/docs/source/en/api/models/unet.md b/docs/source/en/api/models/unet.md index 9a488a3231..66508b469a 100644 --- a/docs/source/en/api/models/unet.md +++ b/docs/source/en/api/models/unet.md @@ -1,6 +1,18 @@ + + # UNet1DModel -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model. +The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model. The abstract from the paper is: @@ -10,4 +22,4 @@ The abstract from the paper is: [[autodoc]] UNet1DModel ## UNet1DOutput -[[autodoc]] models.unet_1d.UNet1DOutput \ No newline at end of file +[[autodoc]] models.unet_1d.UNet1DOutput diff --git a/docs/source/en/api/models/unet2d-cond.md b/docs/source/en/api/models/unet2d-cond.md index a669b02a7f..ea385ff924 100644 --- a/docs/source/en/api/models/unet2d-cond.md +++ b/docs/source/en/api/models/unet2d-cond.md @@ -1,6 +1,18 @@ + + # UNet2DConditionModel -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model. +The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model. The abstract from the paper is: @@ -16,4 +28,4 @@ The abstract from the paper is: [[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionModel ## FlaxUNet2DConditionOutput -[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput \ No newline at end of file +[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput diff --git a/docs/source/en/api/models/unet2d.md b/docs/source/en/api/models/unet2d.md index 29e8163f64..7669d4a5d7 100644 --- a/docs/source/en/api/models/unet2d.md +++ b/docs/source/en/api/models/unet2d.md @@ -1,6 +1,18 @@ + + # UNet2DModel -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model. +The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model. The abstract from the paper is: @@ -10,4 +22,4 @@ The abstract from the paper is: [[autodoc]] UNet2DModel ## UNet2DOutput -[[autodoc]] models.unet_2d.UNet2DOutput \ No newline at end of file +[[autodoc]] models.unet_2d.UNet2DOutput diff --git a/docs/source/en/api/models/unet3d-cond.md b/docs/source/en/api/models/unet3d-cond.md index 83dbb514c8..4eea0a6d1c 100644 --- a/docs/source/en/api/models/unet3d-cond.md +++ b/docs/source/en/api/models/unet3d-cond.md @@ -1,6 +1,18 @@ + + # UNet3DConditionModel -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 3D UNet conditional model. +The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in ๐Ÿค— Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in ๐Ÿค— Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 3D UNet conditional model. The abstract from the paper is: @@ -10,4 +22,4 @@ The abstract from the paper is: [[autodoc]] UNet3DConditionModel ## UNet3DConditionOutput -[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput \ No newline at end of file +[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput diff --git a/docs/source/en/api/models/vq.md b/docs/source/en/api/models/vq.md index cdb6761468..c288b163b2 100644 --- a/docs/source/en/api/models/vq.md +++ b/docs/source/en/api/models/vq.md @@ -1,3 +1,15 @@ + + # VQModel The VQ-VAE model was introduced in [Neural Discrete Representation Learning](https://huggingface.co/papers/1711.00937) by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in ๐Ÿค— Diffusers to decode latent representations into images. Unlike [`AutoencoderKL`], the [`VQModel`] works in a quantized latent space. @@ -12,4 +24,4 @@ The abstract from the paper is: ## VQEncoderOutput -[[autodoc]] models.vq_model.VQEncoderOutput \ No newline at end of file +[[autodoc]] models.vq_model.VQEncoderOutput diff --git a/docs/source/en/api/normalization.md b/docs/source/en/api/normalization.md index 7e09976b15..ccc643ac5e 100644 --- a/docs/source/en/api/normalization.md +++ b/docs/source/en/api/normalization.md @@ -1,3 +1,15 @@ + + # Normalization layers Customized normalization layers for supporting various models in ๐Ÿค— Diffusers. @@ -10,6 +22,10 @@ Customized normalization layers for supporting various models in ๐Ÿค— Diffusers. [[autodoc]] models.normalization.AdaLayerNormZero +## AdaLayerNormSingle + +[[autodoc]] models.normalization.AdaLayerNormSingle + ## AdaGroupNorm -[[autodoc]] models.normalization.AdaGroupNorm \ No newline at end of file +[[autodoc]] models.normalization.AdaGroupNorm diff --git a/docs/source/en/api/outputs.md b/docs/source/en/api/outputs.md index ec64d36498..30bad5646e 100644 --- a/docs/source/en/api/outputs.md +++ b/docs/source/en/api/outputs.md @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. # Outputs -All models outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries. +All model outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries. For example: @@ -64,4 +64,4 @@ To check a specific pipeline or model output, refer to its corresponding API doc ## ImageTextPipelineOutput -[[autodoc]] ImageTextPipelineOutput \ No newline at end of file +[[autodoc]] ImageTextPipelineOutput diff --git a/docs/source/en/api/schedulers/cm_stochastic_iterative.md b/docs/source/en/api/schedulers/cm_stochastic_iterative.md index a1d5f64036..c112c89a12 100644 --- a/docs/source/en/api/schedulers/cm_stochastic_iterative.md +++ b/docs/source/en/api/schedulers/cm_stochastic_iterative.md @@ -1,10 +1,22 @@ + + # CMStochasticIterativeScheduler [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever introduced a multistep and onestep scheduler (Algorithm 1) that is capable of generating good samples in one or a small number of steps. The abstract from the paper is: -*Diffusion models have made significant breakthroughs in image, audio, and video generation, but they depend on an iterative generation process that causes slow sampling speed and caps their potential for real-time applications. To overcome this limitation, we propose consistency models, a new family of generative models that achieve high sample quality without adversarial training. They support fast one-step generation by design, while still allowing for few-step sampling to trade compute for sample quality. They also support zero-shot data editing, like image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either as a way to distill pre-trained diffusion models, or as standalone generative models. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step generation. For example, we achieve the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained as standalone generative models, consistency models also outperform single-step, non-adversarial generative models on standard benchmarks like CIFAR-10, ImageNet 64x64 and LSUN 256x256.* +*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.* The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models). @@ -12,4 +24,4 @@ The original codebase can be found at [openai/consistency_models](https://github [[autodoc]] CMStochasticIterativeScheduler ## CMStochasticIterativeSchedulerOutput -[[autodoc]] schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput diff --git a/docs/source/en/api/schedulers/ddim.md b/docs/source/en/api/schedulers/ddim.md index c5b79cb95f..422b74cff3 100644 --- a/docs/source/en/api/schedulers/ddim.md +++ b/docs/source/en/api/schedulers/ddim.md @@ -16,13 +16,11 @@ specific language governing permissions and limitations under the License. The abstract from the paper is: -*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, -yet they require simulating a Markov chain for many steps to produce a sample. +*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models -with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. +with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. -We empirically demonstrate that DDIMs can produce high quality samples 10ร— to 50ร— faster in terms of wall-clock time compared to DDPMs, allow us to trade off -computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* +We empirically demonstrate that DDIMs can produce high quality samples 10ร— to 50ร— faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* The original codebase of this paper can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author on [tsong.me](https://tsong.me/). @@ -57,13 +55,14 @@ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spaci 4. rescale classifier-free guidance to prevent over-exposure ```py -image = pipeline(prompt, guidance_rescale=0.7).images[0] +image = pipe(prompt, guidance_rescale=0.7).images[0] ``` For example: ```py from diffusers import DiffusionPipeline, DDIMScheduler +import torch pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16) pipe.scheduler = DDIMScheduler.from_config( @@ -72,7 +71,8 @@ pipe.scheduler = DDIMScheduler.from_config( pipe.to("cuda") prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k" -image = pipeline(prompt, guidance_rescale=0.7).images[0] +image = pipe(prompt, guidance_rescale=0.7).images[0] +image ``` ## DDIMScheduler diff --git a/docs/source/en/api/schedulers/ddim_inverse.md b/docs/source/en/api/schedulers/ddim_inverse.md index 52c6d7c859..9b28b9dc59 100644 --- a/docs/source/en/api/schedulers/ddim_inverse.md +++ b/docs/source/en/api/schedulers/ddim_inverse.md @@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License. # DDIMInverseScheduler `DDIMInverseScheduler` is the inverted scheduler from [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. -The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794.pdf). +The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794). ## DDIMInverseScheduler [[autodoc]] DDIMInverseScheduler diff --git a/docs/source/en/api/schedulers/ddpm.md b/docs/source/en/api/schedulers/ddpm.md index c006850e5d..5402d8863d 100644 --- a/docs/source/en/api/schedulers/ddpm.md +++ b/docs/source/en/api/schedulers/ddpm.md @@ -16,10 +16,10 @@ specific language governing permissions and limitations under the License. The abstract from the paper is: -*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.* +*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at [this https URL](https://github.com/hojonathanho/diffusion).* ## DDPMScheduler [[autodoc]] DDPMScheduler ## DDPMSchedulerOutput -[[autodoc]] schedulers.scheduling_ddpm.DDPMSchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_ddpm.DDPMSchedulerOutput diff --git a/docs/source/en/api/schedulers/deis.md b/docs/source/en/api/schedulers/deis.md index 563ede9f0d..fc05dd39ee 100644 --- a/docs/source/en/api/schedulers/deis.md +++ b/docs/source/en/api/schedulers/deis.md @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. # DEISMultistepScheduler -Diffusion Exponential Integrator Sampler (DEIS) is proposed in [Fast Sampling of Diffusion Models with Exponential Integrator](https://huggingface.co/papers/2204.13902) by Qinsheng Zhang and Yongxin Chen. `DEISMultistepScheduler` is a fast high order solver for diffusion ordinary differential equations (ODEs). +Diffusion Exponential Integrator Sampler (DEIS) is proposed in [Fast Sampling of Diffusion Models with Exponential Integrator](https://huggingface.co/papers/2204.13902) by Qinsheng Zhang and Yongxin Chen. `DEISMultistepScheduler` is a fast high order solver for diffusion ordinary differential equations (ODEs). This implementation modifies the polynomial fitting formula in log-rho space instead of the original linear `t` space in the DEIS paper. The modification enjoys closed-form coefficients for exponential multistep update instead of replying on the numerical solver. @@ -20,8 +20,6 @@ The abstract from the paper is: *The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU to generate 50k images from CIFAR10. Moreover, by directly using pre-trained DMs, we achieve the state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 3.37 FID, and 9.74 IS with only 15 NFEs on CIFAR10. Code is available at [this https URL](https://github.com/qsh-zh/deis).* -The original codebase can be found at [qsh-zh/deis](https://github.com/qsh-zh/deis). - ## Tips It is recommended to set `solver_order` to 2 or 3, while `solver_order=1` is equivalent to [`DDIMScheduler`]. @@ -33,4 +31,4 @@ diffusion models, you can set `thresholding=True` to use the dynamic thresholdin [[autodoc]] DEISMultistepScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/dpm_discrete.md b/docs/source/en/api/schedulers/dpm_discrete.md index a8a95a1040..eea09915c6 100644 --- a/docs/source/en/api/schedulers/dpm_discrete.md +++ b/docs/source/en/api/schedulers/dpm_discrete.md @@ -20,4 +20,4 @@ The original codebase can be found at [crowsonkb/k-diffusion](https://github.com [[autodoc]] KDPM2DiscreteScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/dpm_discrete_ancestral.md b/docs/source/en/api/schedulers/dpm_discrete_ancestral.md index 61c68f1cb5..5f8ae193c5 100644 --- a/docs/source/en/api/schedulers/dpm_discrete_ancestral.md +++ b/docs/source/en/api/schedulers/dpm_discrete_ancestral.md @@ -20,4 +20,4 @@ The original codebase can be found at [crowsonkb/k-diffusion](https://github.com [[autodoc]] KDPM2AncestralDiscreteScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/dpm_sde.md b/docs/source/en/api/schedulers/dpm_sde.md index 1eb8b6b666..1486ba3d27 100644 --- a/docs/source/en/api/schedulers/dpm_sde.md +++ b/docs/source/en/api/schedulers/dpm_sde.md @@ -18,4 +18,4 @@ The `DPMSolverSDEScheduler` is inspired by the stochastic sampler from the [Eluc [[autodoc]] DPMSolverSDEScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/euler.md b/docs/source/en/api/schedulers/euler.md index f1b6ed1146..9274328337 100644 --- a/docs/source/en/api/schedulers/euler.md +++ b/docs/source/en/api/schedulers/euler.md @@ -19,4 +19,4 @@ The Euler scheduler (Algorithm 2) is from the [Elucidating the Design Space of D [[autodoc]] EulerDiscreteScheduler ## EulerDiscreteSchedulerOutput -[[autodoc]] schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput diff --git a/docs/source/en/api/schedulers/euler_ancestral.md b/docs/source/en/api/schedulers/euler_ancestral.md index f0e817b49b..c78a407d2e 100644 --- a/docs/source/en/api/schedulers/euler_ancestral.md +++ b/docs/source/en/api/schedulers/euler_ancestral.md @@ -18,4 +18,4 @@ A scheduler that uses ancestral sampling with Euler method steps. This is a fast [[autodoc]] EulerAncestralDiscreteScheduler ## EulerAncestralDiscreteSchedulerOutput -[[autodoc]] schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput diff --git a/docs/source/en/api/schedulers/heun.md b/docs/source/en/api/schedulers/heun.md index 725c1a67f4..abfde24a16 100644 --- a/docs/source/en/api/schedulers/heun.md +++ b/docs/source/en/api/schedulers/heun.md @@ -18,4 +18,4 @@ The Heun scheduler (Algorithm 1) is from the [Elucidating the Design Space of Di [[autodoc]] HeunDiscreteScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/ipndm.md b/docs/source/en/api/schedulers/ipndm.md index 68a1d58dec..b812064934 100644 --- a/docs/source/en/api/schedulers/ipndm.md +++ b/docs/source/en/api/schedulers/ipndm.md @@ -18,4 +18,4 @@ specific language governing permissions and limitations under the License. [[autodoc]] IPNDMScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/lcm.md b/docs/source/en/api/schedulers/lcm.md index fb55e52ac1..5223072fd1 100644 --- a/docs/source/en/api/schedulers/lcm.md +++ b/docs/source/en/api/schedulers/lcm.md @@ -1,3 +1,15 @@ + + # Latent Consistency Model Multistep Scheduler ## Overview diff --git a/docs/source/en/api/schedulers/lms_discrete.md b/docs/source/en/api/schedulers/lms_discrete.md index 5fe90dc4e7..46d95da5fc 100644 --- a/docs/source/en/api/schedulers/lms_discrete.md +++ b/docs/source/en/api/schedulers/lms_discrete.md @@ -18,4 +18,4 @@ specific language governing permissions and limitations under the License. [[autodoc]] LMSDiscreteScheduler ## LMSDiscreteSchedulerOutput -[[autodoc]] schedulers.scheduling_lms_discrete.LMSDiscreteSchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_lms_discrete.LMSDiscreteSchedulerOutput diff --git a/docs/source/en/api/schedulers/multistep_dpm_solver.md b/docs/source/en/api/schedulers/multistep_dpm_solver.md index 3dffa54d44..ce6bde5544 100644 --- a/docs/source/en/api/schedulers/multistep_dpm_solver.md +++ b/docs/source/en/api/schedulers/multistep_dpm_solver.md @@ -21,7 +21,7 @@ samples, and it can generate quite good samples even in 10 steps. It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. -Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. @@ -32,4 +32,4 @@ The SDE variant of DPMSolver and DPM-Solver++ is also supported, but only for th [[autodoc]] DPMSolverMultistepScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md b/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md index b63519b41f..6a286f3d0c 100644 --- a/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md +++ b/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md @@ -14,11 +14,11 @@ specific language governing permissions and limitations under the License. `DPMSolverMultistepInverse` is the inverted scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. -The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794.pdf) and notebook implementation of the [`DiffEdit`] latent inversion from [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb). +The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794) and notebook implementation of the [`DiffEdit`] latent inversion from [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb). ## Tips -Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. diff --git a/docs/source/en/api/schedulers/overview.md b/docs/source/en/api/schedulers/overview.md index 20981b7a2a..ef17e43e72 100644 --- a/docs/source/en/api/schedulers/overview.md +++ b/docs/source/en/api/schedulers/overview.md @@ -61,4 +61,4 @@ The different schedulers in this class, depending on the ordinary differential e ## PushToHubMixin -[[autodoc]] utils.PushToHubMixin \ No newline at end of file +[[autodoc]] utils.PushToHubMixin diff --git a/docs/source/en/api/schedulers/pndm.md b/docs/source/en/api/schedulers/pndm.md index bf0e6661e4..33717662ae 100644 --- a/docs/source/en/api/schedulers/pndm.md +++ b/docs/source/en/api/schedulers/pndm.md @@ -18,4 +18,4 @@ specific language governing permissions and limitations under the License. [[autodoc]] PNDMScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/repaint.md b/docs/source/en/api/schedulers/repaint.md index e68b002163..b3910ad710 100644 --- a/docs/source/en/api/schedulers/repaint.md +++ b/docs/source/en/api/schedulers/repaint.md @@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. The abstract from the paper is: -*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. Github Repository: git.io/RePaint*. +*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. GitHub Repository: [this http URL](http://git.io/RePaint).* The original implementation can be found at [andreas128/RePaint](https://github.com/andreas128/). @@ -24,4 +24,4 @@ The original implementation can be found at [andreas128/RePaint](https://github. [[autodoc]] RePaintScheduler ## RePaintSchedulerOutput -[[autodoc]] schedulers.scheduling_repaint.RePaintSchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_repaint.RePaintSchedulerOutput diff --git a/docs/source/en/api/schedulers/score_sde_ve.md b/docs/source/en/api/schedulers/score_sde_ve.md index 84e077316d..5b930f192d 100644 --- a/docs/source/en/api/schedulers/score_sde_ve.md +++ b/docs/source/en/api/schedulers/score_sde_ve.md @@ -16,10 +16,10 @@ specific language governing permissions and limitations under the License. The abstract from the paper is: -*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model*. +*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* ## ScoreSdeVeScheduler [[autodoc]] ScoreSdeVeScheduler ## SdeVeOutput -[[autodoc]] schedulers.scheduling_sde_ve.SdeVeOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_sde_ve.SdeVeOutput diff --git a/docs/source/en/api/schedulers/score_sde_vp.md b/docs/source/en/api/schedulers/score_sde_vp.md index 0f70a42484..204cba8777 100644 --- a/docs/source/en/api/schedulers/score_sde_vp.md +++ b/docs/source/en/api/schedulers/score_sde_vp.md @@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. The abstract from the paper is: -*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model*. +*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* diff --git a/docs/source/en/api/schedulers/singlestep_dpm_solver.md b/docs/source/en/api/schedulers/singlestep_dpm_solver.md index b5e1a317e1..8962a3e40d 100644 --- a/docs/source/en/api/schedulers/singlestep_dpm_solver.md +++ b/docs/source/en/api/schedulers/singlestep_dpm_solver.md @@ -23,7 +23,7 @@ The original implementation can be found at [LuChengTHU/dpm-solver](https://gith It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. -Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. @@ -32,4 +32,4 @@ Stable Diffusion. [[autodoc]] DPMSolverSinglestepScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/stochastic_karras_ve.md b/docs/source/en/api/schedulers/stochastic_karras_ve.md index 4e37cce815..eb954d7e5e 100644 --- a/docs/source/en/api/schedulers/stochastic_karras_ve.md +++ b/docs/source/en/api/schedulers/stochastic_karras_ve.md @@ -12,10 +12,10 @@ specific language governing permissions and limitations under the License. # KarrasVeScheduler -`KarrasVeScheduler` is a stochastic sampler tailored o variance-expanding (VE) models. It is based on the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) and [Score-based generative modeling through stochastic differential equations](https://huggingface.co/papers/2011.13456) papers. +`KarrasVeScheduler` is a stochastic sampler tailored to variance-expanding (VE) models. It is based on the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) and [Score-based generative modeling through stochastic differential equations](https://huggingface.co/papers/2011.13456) papers. ## KarrasVeScheduler [[autodoc]] KarrasVeScheduler ## KarrasVeOutput -[[autodoc]] schedulers.scheduling_karras_ve.KarrasVeOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_karras_ve.KarrasVeOutput diff --git a/docs/source/en/api/schedulers/unipc.md b/docs/source/en/api/schedulers/unipc.md index 56c6fd5bac..df514ca4a6 100644 --- a/docs/source/en/api/schedulers/unipc.md +++ b/docs/source/en/api/schedulers/unipc.md @@ -19,19 +19,17 @@ UniPC is by design model-agnostic, supporting pixel-space/latent-space DPMs on u The abstract from the paper is: -*Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM usually requires hundreds of model evaluations, which is computationally expensive. Despite recent progress in designing high-order solvers for DPMs, there still exists room for further speedup, especially in extremely few steps (e.g., 5~10 steps). Inspired by the predictor-corrector for ODE solvers, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256times256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC*. - -The original codebase can be found at [wl-zhao/UniPC](https://github.com/wl-zhao/UniPC). +*Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., <10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256ร—256 (conditional) with only 10 function evaluations. Code is available at [this https URL](https://github.com/wl-zhao/UniPC).* ## Tips It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. -Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `predict_x0=True` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. ## UniPCMultistepScheduler [[autodoc]] UniPCMultistepScheduler ## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/vq_diffusion.md b/docs/source/en/api/schedulers/vq_diffusion.md index 5d31a3e3c6..09928583f6 100644 --- a/docs/source/en/api/schedulers/vq_diffusion.md +++ b/docs/source/en/api/schedulers/vq_diffusion.md @@ -22,4 +22,4 @@ The abstract from the paper is: [[autodoc]] VQDiffusionScheduler ## VQDiffusionSchedulerOutput -[[autodoc]] schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput \ No newline at end of file +[[autodoc]] schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput diff --git a/docs/source/en/api/utilities.md b/docs/source/en/api/utilities.md index abc3841605..77ada08348 100644 --- a/docs/source/en/api/utilities.md +++ b/docs/source/en/api/utilities.md @@ -1,3 +1,15 @@ + + # Utilities Utility and helper functions for working with ๐Ÿค— Diffusers. @@ -24,4 +36,4 @@ Utility and helper functions for working with ๐Ÿค— Diffusers. ## make_image_grid -[[autodoc]] utils.pil_utils.make_image_grid +[[autodoc]] utils.make_image_grid