diff --git a/docs/source/en/using-diffusers/kandinsky.md b/docs/source/en/using-diffusers/kandinsky.md index 05be2e1ee2..0fbec32a52 100644 --- a/docs/source/en/using-diffusers/kandinsky.md +++ b/docs/source/en/using-diffusers/kandinsky.md @@ -20,6 +20,8 @@ The Kandinsky models are a series of multilingual text-to-image generation model [Kandinsky 2.2](../api/pipelines/kandinsky_v22) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes. +[Kandinsky 3](../api/pipelines/kandinsky3) simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model. Instead, Kandinsky 3 uses [Flan-UL2](https://huggingface.co/google/flan-ul2) to encode text, a UNet with [BigGan-deep](https://hf.co/papers/1809.11096) blocks, and [Sber-MoVQGAN](https://github.com/ai-forever/MoVQGAN) to decode the latents into images. Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet. + This guide will show you how to use the Kandinsky models for text-to-image, image-to-image, inpainting, interpolation, and more. Before you begin, make sure you have the following libraries installed: @@ -33,6 +35,10 @@ Before you begin, make sure you have the following libraries installed: Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 doesn't accept `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding. +
+ +Kandinsky 3 has a more concise architecture and it doesn't require a prior model. This means it's usage is identical to other diffusion models like [Stable Diffusion XL](sdxl). + ## Text-to-image @@ -91,6 +97,23 @@ image + + + +Kandinsky 3 doesn't require a prior model so you can directly load the [`Kandinsky3Pipeline`] and pass a prompt to generate an image: + +```py +from diffusers import Kandinsky3Pipeline +import torch + +pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16) +pipeline.enable_model_cpu_offload() + +prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" +image = pipeline(prompt).images[0] +image +``` + @@ -161,6 +184,20 @@ prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kan pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda") ``` + + + +Kandinsky 3 doesn't require a prior model so you can directly load the image-to-image pipeline: + +```py +from diffusers import Kandinsky3Img2ImgPipeline +from diffusers.utils import load_image +import torch + +pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16) +pipeline.enable_model_cpu_offload() +``` + @@ -218,6 +255,14 @@ make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], r + + + +```py +image = pipeline(prompt, negative_prompt=negative_prompt, image=image, strength=0.75, num_inference_steps=25).images[0] +image +``` + diff --git a/src/diffusers/pipelines/kandinsky3/pipeline_kandinsky3.py b/src/diffusers/pipelines/kandinsky3/pipeline_kandinsky3.py index 4d14fc637b..fcf7ddcb99 100644 --- a/src/diffusers/pipelines/kandinsky3/pipeline_kandinsky3.py +++ b/src/diffusers/pipelines/kandinsky3/pipeline_kandinsky3.py @@ -110,7 +110,7 @@ class Kandinsky3Pipeline(DiffusionPipeline, LoraLoaderMixin): Encodes the prompt into text encoder hidden states. Args: - prompt (`str` or `List[str]`, *optional*): + prompt (`str` or `List[str]`, *optional*): prompt to be encoded device: (`torch.device`, *optional*): torch device to place the resulting embeddings on @@ -365,7 +365,7 @@ class Kandinsky3Pipeline(DiffusionPipeline, LoraLoaderMixin): prompt (`str` or `List[str]`, *optional*): The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead. - num_inference_steps (`int`, *optional*, defaults to 50): + num_inference_steps (`int`, *optional*, defaults to 25): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. timesteps (`List[int]`, *optional*):