-

-
"A photo of a banana-shaped couch in a living room"
+
+

+
A cute cat lounges on a leaf in a pool during a peaceful summer afternoon, in lofi art style, illustration.
-
-

-
"A vibrant yellow banana-shaped couch sits in a cozy living room, its curve cradling a pile of colorful cushions. on the wooden floor, a patterned rug adds a touch of eclectic charm, and a potted plant sits in the corner, reaching towards the sunlight filtering through the windows"
+
+

+
A cute cat lounges on a floating leaf in a sparkling pool during a peaceful summer afternoon. Clear reflections ripple across the water, with sunlight casting soft, smooth highlights. The illustration is detailed and polished, with elegant lines and harmonious colors, evoking a relaxing, serene, and whimsical lofi mood, anime-inspired and visually comforting.
-## Prompt enhancing with GPT2
-
-Prompt enhancing is a technique for quickly improving prompt quality without spending too much effort constructing one. It uses a model like GPT2 pretrained on Stable Diffusion text prompts to automatically enrich a prompt with additional important keywords to generate high-quality images.
-
-The technique works by curating a list of specific keywords and forcing the model to generate those words to enhance the original prompt. This way, your prompt can be "a cat" and GPT2 can enhance the prompt to "cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic".
+Be specific and add context. Use photography terms like lens type, focal length, camera angles, and depth of field.
> [!TIP]
-> You should also use a [*offset noise*](https://www.crosslabs.org//blog/diffusion-with-offset-noise) LoRA to improve the contrast in bright and dark images and create better lighting overall. This [LoRA](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_offset_example-lora_1.0.safetensors) is available from [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0).
-
-Start by defining certain styles and a list of words (you can check out a more comprehensive list of [words](https://hf.co/LykosAI/GPT-Prompt-Expansion-Fooocus-v2/blob/main/positive.txt) and [styles](https://github.com/lllyasviel/Fooocus/tree/main/sdxl_styles) used by Fooocus) to enhance a prompt with.
-
-```py
-import torch
-from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList
-from diffusers import StableDiffusionXLPipeline
-
-styles = {
- "cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
- "anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
- "photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
- "comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
- "lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
- "pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
-}
-
-words = [
- "aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
- "exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
- "inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
- "intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
- "soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
- "elegant", "awesome", "amazing", "dynamic", "trendy",
-]
-```
-
-You may have noticed in the `words` list, there are certain words that can be paired together to create something more meaningful. For example, the words "high" and "quality" can be combined to create "high quality". Let's pair these words together and remove the words that can't be paired.
-
-```py
-word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]
-
-def find_and_order_pairs(s, pairs):
- words = s.split()
- found_pairs = []
- for pair in pairs:
- pair_words = pair.split()
- if pair_words[0] in words and pair_words[1] in words:
- found_pairs.append(pair)
- words.remove(pair_words[0])
- words.remove(pair_words[1])
-
- for word in words[:]:
- for pair in pairs:
- if word in pair.split():
- words.remove(word)
- break
- ordered_pairs = ", ".join(found_pairs)
- remaining_s = ", ".join(words)
- return ordered_pairs, remaining_s
-```
-
-Next, implement a custom [`~transformers.LogitsProcessor`] class that assigns tokens in the `words` list a value of 0 and assigns tokens not in the `words` list a negative value so they aren't picked during generation. This way, generation is biased towards words in the `words` list. After a word from the list is used, it is also assigned a negative value so it isn't picked again.
-
-```py
-class CustomLogitsProcessor(LogitsProcessor):
- def __init__(self, bias):
- super().__init__()
- self.bias = bias
-
- def __call__(self, input_ids, scores):
- if len(input_ids.shape) == 2:
- last_token_id = input_ids[0, -1]
- self.bias[last_token_id] = -1e10
- return scores + self.bias
-
-word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
-bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
-bias[word_ids] = 0
-processor = CustomLogitsProcessor(bias)
-processor_list = LogitsProcessorList([processor])
-```
-
-Combine the prompt and the `cinematic` style prompt defined in the `styles` dictionary earlier.
-
-```py
-prompt = "a cat basking in the sun on a roof in Turkey"
-style = "cinematic"
-
-prompt = styles[style].format(prompt=prompt)
-prompt
-"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
-```
-
-Load a GPT2 tokenizer and model from the [Gustavosta/MagicPrompt-Stable-Diffusion](https://huggingface.co/Gustavosta/MagicPrompt-Stable-Diffusion) checkpoint (this specific checkpoint is trained to generate prompts) to enhance the prompt.
-
-```py
-tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
-model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
- "cuda"
-)
-model.eval()
-
-inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
-token_count = inputs["input_ids"].shape[1]
-max_new_tokens = 50 - token_count
-
-generation_config = GenerationConfig(
- penalty_alpha=0.7,
- top_k=50,
- eos_token_id=model.config.eos_token_id,
- pad_token_id=model.config.eos_token_id,
- pad_token=model.config.pad_token_id,
- do_sample=True,
-)
-
-with torch.no_grad():
- generated_ids = model.generate(
- input_ids=inputs["input_ids"],
- attention_mask=inputs["attention_mask"],
- max_new_tokens=max_new_tokens,
- generation_config=generation_config,
- logits_processor=proccesor_list,
- )
-```
-
-Then you can combine the input prompt and the generated prompt. Feel free to take a look at what the generated prompt (`generated_part`) is, the word pairs that were found (`pairs`), and the remaining words (`words`). This is all packed together in the `enhanced_prompt`.
-
-```py
-output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
-input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
-pairs, words = find_and_order_pairs(generated_part, word_pairs)
-formatted_generated_part = pairs + ", " + words
-enhanced_prompt = input_part + ", " + formatted_generated_part
-enhanced_prompt
-["cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic"]
-```
-
-Finally, load a pipeline and the offset noise LoRA with a *low weight* to generate an image with the enhanced prompt.
-
-```py
-pipeline = StableDiffusionXLPipeline.from_pretrained(
- "RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
-).to("cuda")
-
-pipeline.load_lora_weights(
- "stabilityai/stable-diffusion-xl-base-1.0",
- weight_name="sd_xl_offset_example-lora_1.0.safetensors",
- adapter_name="offset",
-)
-pipeline.set_adapters(["offset"], adapter_weights=[0.2])
-
-image = pipeline(
- enhanced_prompt,
- width=1152,
- height=896,
- guidance_scale=7.5,
- num_inference_steps=25,
-).images[0]
-image
-```
-
-
-
-

-
"a cat basking in the sun on a roof in Turkey"
-
-
-

-
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
-
-
+> Try a [prompt enhancer](https://huggingface.co/models?sort=downloads&search=prompt+enhancer) to help improve your prompt structure.
## Prompt weighting
-Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image. A prompt can include several concepts, which gets turned into contextualized text embeddings. The embeddings are used by the model to condition its cross-attention layers to generate an image (read the Stable Diffusion [blog post](https://huggingface.co/blog/stable_diffusion) to learn more about how it works).
+Prompt weighting makes some words stronger and others weaker. It scales attention scores so you control how much influence each concept has.
-Prompt weighting works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally. The easiest way to prepare the prompt embeddings is to use [Stable Diffusion Long Prompt Weighted Embedding](https://github.com/xhinker/sd_embed) (sd_embed). Once you have the prompt-weighted embeddings, you can pass them to any pipeline that has a [prompt_embeds](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.prompt_embeds) (and optionally [negative_prompt_embeds](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.negative_prompt_embeds)) parameter, such as [`StableDiffusionPipeline`], [`StableDiffusionControlNetPipeline`], and [`StableDiffusionXLPipeline`].
+Diffusers handles this through `prompt_embeds` and `pooled_prompt_embeds` arguments which take scaled text embedding vectors. Use the [sd_embed](https://github.com/xhinker/sd_embed) library to generate these embeddings. It also supports longer prompts.
-> [!TIP]
-> If your favorite pipeline doesn't have a `prompt_embeds` parameter, please open an [issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can add it!
-
-This guide will show you how to weight your prompts with sd_embed.
-
-Before you begin, make sure you have the latest version of sd_embed installed:
-
-```bash
-pip install git+https://github.com/xhinker/sd_embed.git@main
-```
-
-For this example, let's use [`StableDiffusionXLPipeline`].
+> [!NOTE]
+> The sd_embed library only supports Stable Diffusion, Stable Diffusion XL, Stable Diffusion 3, Stable Cascade, and Flux. Prompt weighting doesn't necessarily help for newer models like Flux which already has very good prompt adherence.
```py
-from diffusers import StableDiffusionXLPipeline, UniPCMultistepScheduler
-import torch
-
-pipe = StableDiffusionXLPipeline.from_pretrained("Lykon/dreamshaper-xl-1-0", torch_dtype=torch.float16)
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-pipe.to("cuda")
+!uv pip install git+https://github.com/xhinker/sd_embed.git@main
```
-To upweight or downweight a concept, surround the text with parentheses. More parentheses applies a heavier weight on the text. You can also append a numerical multiplier to the text to indicate how much you want to increase or decrease its weights by.
+Format weighted text with numerical multipliers or parentheses. More parentheses mean stronger weighting.
| format | multiplier |
|---|---|
-| `(hippo)` | increase by 1.1x |
-| `((hippo))` | increase by 1.21x |
-| `(hippo:1.5)` | increase by 1.5x |
-| `(hippo:0.5)` | decrease by 4x |
+| `(cat)` | increase by 1.1x |
+| `((cat))` | increase by 1.21x |
+| `(cat:1.5)` | increase by 1.5x |
+| `(cat:0.5)` | decrease by 4x |
-Create a prompt and use a combination of parentheses and numerical multipliers to upweight various text.
+Create a weighted prompt and pass it to [get_weighted_text_embeddings_sdxl](https://github.com/xhinker/sd_embed/blob/4a47f71150a22942fa606fb741a1c971d95ba56f/src/sd_embed/embedding_funcs.py#L405) to generate embeddings.
+
+> [!TIP]
+> You could also pass negative prompts to `negative_prompt_embeds` and `negative_pooled_prompt_embeds`.
```py
+import torch
+from diffusers import DiffusionPipeline
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sdxl
-prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
-This imaginative creature features the distinctive, bulky body of a hippo,
-but with a texture and appearance resembling a golden-brown, crispy waffle.
-The creature might have elements like waffle squares across its skin and a syrup-like sheen.
-It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
-possibly including oversized utensils or plates in the background.
-The image should evoke a sense of playful absurdity and culinary fantasy.
-"""
-
-neg_prompt = """\
-skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
-(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
-extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
-(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
-bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
-(normal quality:2),lowres,((monochrome)),((grayscale))
-"""
-```
-
-Use the `get_weighted_text_embeddings_sdxl` function to generate the prompt embeddings and the negative prompt embeddings. It'll also generated the pooled and negative pooled prompt embeddings since you're using the SDXL model.
-
-> [!TIP]
-> You can safely ignore the error message below about the token index length exceeding the models maximum sequence length. All your tokens will be used in the embedding process.
->
-> ```
-> Token indices sequence length is longer than the specified maximum sequence length for this model
-> ```
-
-```py
-(
- prompt_embeds,
- prompt_neg_embeds,
- pooled_prompt_embeds,
- negative_pooled_prompt_embeds
-) = get_weighted_text_embeddings_sdxl(
- pipe,
- prompt=prompt,
- neg_prompt=neg_prompt
+pipeline = DiffusionPipeline.from_pretrained(
+ "Lykon/dreamshaper-xl-1-0", torch_dtype=torch.bfloat16, device_map="cuda"
)
-image = pipe(
- prompt_embeds=prompt_embeds,
- negative_prompt_embeds=prompt_neg_embeds,
- pooled_prompt_embeds=pooled_prompt_embeds,
- negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
- num_inference_steps=30,
- height=1024,
- width=1024 + 512,
- guidance_scale=4.0,
- generator=torch.Generator("cuda").manual_seed(2)
-).images[0]
-image
+prompt = """
+A (cute cat:1.4) lounges on a (floating leaf:1.2) in a (sparkling pool:1.1) during a peaceful summer afternoon.
+Gentle ripples reflect pastel skies, while (sunlight:1.1) casts soft highlights. The illustration is smooth and polished
+with elegant, sketchy lines and subtle gradients, evoking a ((whimsical, nostalgic, dreamy lofi atmosphere:2.0)),
+(anime-inspired:1.6), calming, comforting, and visually serene.
+"""
+
+prompt_embeds, _, pooled_prompt_embeds, *_ = get_weighted_text_embeddings_sdxl(pipeline, prompt=prompt)
+```
+
+Pass the embeddings to `prompt_embeds` and `pooled_prompt_embeds` to generate your image.
+
+```py
+image = pipeline(prompt_embeds=prompt_embeds, pooled_prompt_embeds=pooled_prompt_embeds).images[0]
```
-

+
-> [!TIP]
-> Refer to the [sd_embed](https://github.com/xhinker/sd_embed) repository for additional details about long prompt weighting for FLUX.1, Stable Cascade, and Stable Diffusion 1.5.
-
-### Textual inversion
-
-[Textual inversion](../training/text_inversion) is a technique for learning a specific concept from some images which you can use to generate new images conditioned on that concept.
-
-Create a pipeline and use the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] function to load the textual inversion embeddings (feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer) for 100+ trained concepts):
-
-```py
-import torch
-from diffusers import StableDiffusionPipeline
-
-pipe = StableDiffusionPipeline.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5",
- torch_dtype=torch.float16,
-).to("cuda")
-pipe.load_textual_inversion("sd-concepts-library/midjourney-style")
-```
-
-Add the `
` text to the prompt to trigger the textual inversion.
-
-```py
-from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15
-
-prompt = """ A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
-This imaginative creature features the distinctive, bulky body of a hippo,
-but with a texture and appearance resembling a golden-brown, crispy waffle.
-The creature might have elements like waffle squares across its skin and a syrup-like sheen.
-It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
-possibly including oversized utensils or plates in the background.
-The image should evoke a sense of playful absurdity and culinary fantasy.
-"""
-
-neg_prompt = """\
-skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
-(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
-extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
-(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
-bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
-(normal quality:2),lowres,((monochrome)),((grayscale))
-"""
-```
-
-Use the `get_weighted_text_embeddings_sd15` function to generate the prompt embeddings and the negative prompt embeddings.
-
-```py
-(
- prompt_embeds,
- prompt_neg_embeds,
-) = get_weighted_text_embeddings_sd15(
- pipe,
- prompt=prompt,
- neg_prompt=neg_prompt
-)
-
-image = pipe(
- prompt_embeds=prompt_embeds,
- negative_prompt_embeds=prompt_neg_embeds,
- height=768,
- width=896,
- guidance_scale=4.0,
- generator=torch.Generator("cuda").manual_seed(2)
-).images[0]
-image
-```
-
-
-

-
-
-### DreamBooth
-
-[DreamBooth](../training/dreambooth) is a technique for generating contextualized images of a subject given just a few images of the subject to train on. It is similar to textual inversion, but DreamBooth trains the full model whereas textual inversion only fine-tunes the text embeddings. This means you should use [`~DiffusionPipeline.from_pretrained`] to load the DreamBooth model (feel free to browse the [Stable Diffusion Dreambooth Concepts Library](https://huggingface.co/sd-dreambooth-library) for 100+ trained models):
-
-```py
-import torch
-from diffusers import DiffusionPipeline, UniPCMultistepScheduler
-
-pipe = DiffusionPipeline.from_pretrained("sd-dreambooth-library/dndcoverart-v1", torch_dtype=torch.float16).to("cuda")
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-```
-
-Depending on the model you use, you'll need to incorporate the model's unique identifier into your prompt. For example, the `dndcoverart-v1` model uses the identifier `dndcoverart`:
-
-```py
-from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15
-
-prompt = """dndcoverart of A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
-This imaginative creature features the distinctive, bulky body of a hippo,
-but with a texture and appearance resembling a golden-brown, crispy waffle.
-The creature might have elements like waffle squares across its skin and a syrup-like sheen.
-It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
-possibly including oversized utensils or plates in the background.
-The image should evoke a sense of playful absurdity and culinary fantasy.
-"""
-
-neg_prompt = """\
-skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
-(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
-extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
-(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
-bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
-(normal quality:2),lowres,((monochrome)),((grayscale))
-"""
-
-(
- prompt_embeds
- , prompt_neg_embeds
-) = get_weighted_text_embeddings_sd15(
- pipe
- , prompt = prompt
- , neg_prompt = neg_prompt
-)
-```
-
-
-

-
+Prompt weighting works with [Textual inversion](./textual_inversion_inference) and [DreamBooth](./dreambooth) adapters too.
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/write_own_pipeline.md b/docs/source/en/using-diffusers/write_own_pipeline.md
index 930b0fe21f..e34727b5da 100644
--- a/docs/source/en/using-diffusers/write_own_pipeline.md
+++ b/docs/source/en/using-diffusers/write_own_pipeline.md
@@ -280,5 +280,5 @@ This is really what 🧨 Diffusers is designed for: to make it intuitive and eas
For your next steps, feel free to:
-* Learn how to [build and contribute a pipeline](../using-diffusers/contribute_pipeline) to 🧨 Diffusers. We can't wait and see what you'll come up with!
+* Learn how to [build and contribute a pipeline](../conceptual/contribution) to 🧨 Diffusers. We can't wait and see what you'll come up with!
* Explore [existing pipelines](../api/pipelines/overview) in the library, and see if you can deconstruct and build a pipeline from scratch using the models and schedulers separately.
diff --git a/docs/source/ko/conceptual/ethical_guidelines.md b/docs/source/ko/conceptual/ethical_guidelines.md
index b8c55048bf..63fc4a7741 100644
--- a/docs/source/ko/conceptual/ethical_guidelines.md
+++ b/docs/source/ko/conceptual/ethical_guidelines.md
@@ -14,51 +14,47 @@ specific language governing permissions and limitations under the License.
## 서문 [[preamble]]
-[Diffusers](https://huggingface.co/docs/diffusers/index)는 사전 훈련된 diffusion 모델을 제공하며 추론 및 훈련을 위한 모듈식 툴박스로 사용됩니다.
+[Diffusers](https://huggingface.co/docs/diffusers/index)는 사전 훈련된 diffusion 모델을 제공하며, 추론과 훈련을 위한 모듈형 툴박스로 활용됩니다.
-이 기술의 실제 적용과 사회에 미칠 수 있는 부정적인 영향을 고려하여 Diffusers 라이브러리의 개발, 사용자 기여 및 사용에 윤리 지침을 제공하는 것이 중요하다고 생각합니다.
-
-이이 기술을 사용함에 따른 위험은 여전히 검토 중이지만, 몇 가지 예를 들면: 예술가들에 대한 저작권 문제; 딥 페이크의 악용; 부적절한 맥락에서의 성적 콘텐츠 생성; 동의 없는 사칭; 소수자 집단의 억압을 영속화하는 유해한 사회적 편견 등이 있습니다.
-
-우리는 위험을 지속적으로 추적하고 커뮤니티의 응답과 소중한 피드백에 따라 다음 지침을 조정할 것입니다.
+이 기술의 실제 적용 사례와 사회에 미칠 수 있는 잠재적 부정적 영향을 고려할 때, Diffusers 라이브러리의 개발, 사용자 기여, 사용에 윤리 지침을 제공하는 것이 중요하다고 생각합니다.
+이 기술 사용과 관련된 위험은 여전히 검토 중이지만, 예를 들면: 예술가의 저작권 문제, 딥페이크 악용, 부적절한 맥락에서의 성적 콘텐츠 생성, 비동의 사칭, 소수자 집단 억압을 영속화하는 유해한 사회적 편견 등이 있습니다.
+우리는 이러한 위험을 지속적으로 추적하고, 커뮤니티의 반응과 소중한 피드백에 따라 아래 지침을 조정할 것입니다.
## 범위 [[scope]]
-Diffusers 커뮤니티는 프로젝트의 개발에 다음과 같은 윤리 지침을 적용하며, 특히 윤리적 문제와 관련된 민감한 주제에 대한 커뮤니티의 기여를 조정하는 데 도움을 줄 것입니다.
-
+Diffusers 커뮤니티는 프로젝트 개발에 다음 윤리 지침을 적용하며, 특히 윤리적 문제와 관련된 민감한 주제에 대해 커뮤니티의 기여를 조정하는 데 도움을 줄 것입니다.
## 윤리 지침 [[ethical-guidelines]]
-다음 윤리 지침은 일반적으로 적용되지만, 민감한 윤리적 문제와 관련하여 기술적 선택을 할 때 이를 우선적으로 적용할 것입니다. 나아가, 해당 기술의 최신 동향과 관련된 새로운 위험이 발생함에 따라 이러한 윤리 원칙을 조정할 것을 약속드립니다.
+다음 윤리 지침은 일반적으로 적용되지만, 윤리적으로 민감한 문제와 관련된 기술적 선택을 할 때 우선적으로 적용됩니다. 또한, 해당 기술의 최신 동향과 관련된 새로운 위험이 발생함에 따라 이러한 윤리 원칙을 지속적으로 조정할 것을 약속합니다.
-- **투명성**: 우리는 PR을 관리하고, 사용자에게 우리의 선택을 설명하며, 기술적 의사결정을 내릴 때 투명성을 유지할 것을 약속합니다.
+- **투명성**: 우리는 PR 관리, 사용자에게 선택의 이유 설명, 기술적 의사결정 과정에서 투명성을 유지할 것을 약속합니다.
-- **일관성**: 우리는 프로젝트 관리에서 사용자들에게 동일한 수준의 관심을 보장하고 기술적으로 안정되고 일관된 상태를 유지할 것을 약속합니다.
+- **일관성**: 프로젝트 관리에서 모든 사용자에게 동일한 수준의 관심을 보장하고, 기술적으로 안정적이고 일관된 상태를 유지할 것을 약속합니다.
-- **간결성**: Diffusers 라이브러리를 사용하고 활용하기 쉽게 만들기 위해, 프로젝트의 목표를 간결하고 일관성 있게 유지할 것을 약속합니다.
+- **간결성**: Diffusers 라이브러리를 쉽게 사용하고 활용할 수 있도록, 프로젝트의 목표를 간결하고 일관성 있게 유지할 것을 약속합니다.
-- **접근성**: Diffusers 프로젝트는 기술적 전문 지식 없어도 프로젝트 운영에 참여할 수 있는 기여자의 진입장벽을 낮춥니다. 이를 통해 연구 결과물이 커뮤니티에 더 잘 접근할 수 있게 됩니다.
+- **접근성**: Diffusers 프로젝트는 기술적 전문지식이 없어도 기여할 수 있도록 진입장벽을 낮춥니다. 이를 통해 연구 결과물이 커뮤니티에 더 잘 접근될 수 있습니다.
-- **재현성**: 우리는 Diffusers 라이브러리를 통해 제공되는 업스트림(upstream) 코드, 모델 및 데이터셋의 재현성에 대해 투명하게 공개할 것을 목표로 합니다.
-
-- **책임**: 우리는 커뮤니티와 팀워크를 통해, 이 기술의 잠재적인 위험과 위험을 예측하고 완화하는 데 대한 공동 책임을 가지고 있습니다.
+- **재현성**: 우리는 Diffusers 라이브러리를 통해 제공되는 업스트림 코드, 모델, 데이터셋의 재현성에 대해 투명하게 공개하는 것을 목표로 합니다.
+- **책임**: 커뮤니티와 팀워크를 통해, 이 기술의 잠재적 위험을 예측하고 완화하는 데 공동 책임을 집니다.
## 구현 사례: 안전 기능과 메커니즘 [[examples-of-implementations-safety-features-and-mechanisms]]
-팀은 diffusion 기술과 관련된 잠재적인 윤리 및 사회적 위험에 대처하기 위한 기술적 및 비기술적 도구를 제공하고자 하고 있습니다. 또한, 커뮤니티의 참여는 이러한 기능의 구현하고 우리와 함께 인식을 높이는 데 매우 중요합니다.
+팀은 diffusion 기술과 관련된 잠재적 윤리 및 사회적 위험에 대응하기 위해 기술적·비기술적 도구를 제공하고자 노력하고 있습니다. 또한, 커뮤니티의 참여는 이러한 기능 구현과 인식 제고에 매우 중요합니다.
-- [**커뮤니티 탭**](https://huggingface.co/docs/hub/repositories-pull-requests-discussions): 이를 통해 커뮤니티는 프로젝트에 대해 토론하고 더 나은 협력을 할 수 있습니다.
+- [**커뮤니티 탭**](https://huggingface.co/docs/hub/repositories-pull-requests-discussions): 커뮤니티가 프로젝트에 대해 토론하고 더 나은 협업을 할 수 있도록 지원합니다.
-- **편향 탐색 및 평가**: Hugging Face 팀은 Stable Diffusion 모델의 편향성을 대화형으로 보여주는 [space](https://huggingface.co/spaces/society-ethics/DiffusionBiasExplorer)을 제공합니다. 이런 의미에서, 우리는 편향 탐색 및 평가를 지원하고 장려합니다.
+- **편향 탐색 및 평가**: Hugging Face 팀은 Stable Diffusion 모델의 편향성을 대화형으로 보여주는 [space](https://huggingface.co/spaces/society-ethics/DiffusionBiasExplorer)를 제공합니다. 우리는 이러한 편향 탐색과 평가를 지원하고 장려합니다.
- **배포에서의 안전 유도**
- - [**안전한 Stable Diffusion**](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_safe): 이는 필터되지 않은 웹 크롤링 데이터셋으로 훈련된 Stable Diffusion과 같은 모델이 부적절한 변질에 취약한 문제를 완화합니다. 관련 논문: [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105).
+ - [**안전한 Stable Diffusion**](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_safe): 필터링되지 않은 웹 크롤링 데이터셋으로 훈련된 Stable Diffusion과 같은 모델이 부적절하게 변질되는 문제를 완화합니다. 관련 논문: [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105).
- - [**안전 검사기**](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py): 이미지가 생성된 후에 이미자가 임베딩 공간에서 일련의 하드코딩된 유해 개념의 클래스일 확률을 확인하고 비교합니다. 유해 개념은 역공학을 방지하기 위해 의도적으로 숨겨져 있습니다.
+ - [**안전 검사기**](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py): 생성된 이미지가 임베딩 공간에서 하드코딩된 유해 개념 클래스와 일치할 확률을 확인하고 비교합니다. 유해 개념은 역공학을 방지하기 위해 의도적으로 숨겨져 있습니다.
-- **Hub에서의 단계적인 배포**: 특히 민감한 상황에서는 일부 리포지토리에 대한 접근을 제한해야 합니다. 이 단계적인 배포는 중간 단계로, 리포지토리 작성자가 사용에 대한 더 많은 통제력을 갖게 합니다.
+- **Hub에서의 단계적 배포**: 특히 민감한 상황에서는 일부 리포지토리에 대한 접근을 제한할 수 있습니다. 단계적 배포는 리포지토리 작성자가 사용에 대해 더 많은 통제권을 갖도록 하는 중간 단계입니다.
-- **라이선싱**: [OpenRAILs](https://huggingface.co/blog/open_rail)와 같은 새로운 유형의 라이선싱을 통해 자유로운 접근을 보장하면서도 더 책임 있는 사용을 위한 일련의 제한을 둘 수 있습니다.
+- **라이선싱**: [OpenRAILs](https://huggingface.co/blog/open_rail)와 같은 새로운 유형의 라이선스를 통해 자유로운 접근을 보장하면서도 보다 책임 있는 사용을 위한 일련의 제한을 둘 수 있습니다.
diff --git a/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py b/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py
index a46490e8b3..5aa33190d4 100644
--- a/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py
+++ b/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py
@@ -25,6 +25,10 @@
# "Jinja2",
# "peft>=0.11.1",
# "sentencepiece",
+# "torchvision",
+# "datasets",
+# "bitsandbytes",
+# "prodigyopt",
# ]
# ///
diff --git a/examples/dreambooth/train_dreambooth_lora_flux.py b/examples/dreambooth/train_dreambooth_lora_flux.py
index bd3a974a17..3b6ab814f2 100644
--- a/examples/dreambooth/train_dreambooth_lora_flux.py
+++ b/examples/dreambooth/train_dreambooth_lora_flux.py
@@ -25,6 +25,10 @@
# "Jinja2",
# "peft>=0.11.1",
# "sentencepiece",
+# "torchvision",
+# "datasets",
+# "bitsandbytes",
+# "prodigyopt",
# ]
# ///
diff --git a/examples/dreambooth/train_dreambooth_lora_flux_kontext.py b/examples/dreambooth/train_dreambooth_lora_flux_kontext.py
index 03c05a05e0..fc6df87768 100644
--- a/examples/dreambooth/train_dreambooth_lora_flux_kontext.py
+++ b/examples/dreambooth/train_dreambooth_lora_flux_kontext.py
@@ -14,6 +14,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+# /// script
+# dependencies = [
+# "diffusers @ git+https://github.com/huggingface/diffusers.git",
+# "torch>=2.0.0",
+# "accelerate>=0.31.0",
+# "transformers>=4.41.2",
+# "ftfy",
+# "tensorboard",
+# "Jinja2",
+# "peft>=0.11.1",
+# "sentencepiece",
+# "torchvision",
+# "datasets",
+# "bitsandbytes",
+# "prodigyopt",
+# ]
+# ///
+
import argparse
import copy
import itertools
diff --git a/examples/dreambooth/train_dreambooth_lora_qwen_image.py b/examples/dreambooth/train_dreambooth_lora_qwen_image.py
index feec4da712..56de160d6f 100644
--- a/examples/dreambooth/train_dreambooth_lora_qwen_image.py
+++ b/examples/dreambooth/train_dreambooth_lora_qwen_image.py
@@ -13,6 +13,24 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
+# /// script
+# dependencies = [
+# "diffusers @ git+https://github.com/huggingface/diffusers.git",
+# "torch>=2.0.0",
+# "accelerate>=0.31.0",
+# "transformers>=4.41.2",
+# "ftfy",
+# "tensorboard",
+# "Jinja2",
+# "peft>=0.11.1",
+# "sentencepiece",
+# "torchvision",
+# "datasets",
+# "bitsandbytes",
+# "prodigyopt",
+# ]
+# ///
+
import argparse
import copy
import itertools
@@ -1320,7 +1338,7 @@ def main(args):
batch["pixel_values"] = batch["pixel_values"].to(
accelerator.device, non_blocking=True, dtype=vae.dtype
)
- latents_cache.append(vae.encode(batch["pixel_values"]).latent_dist)
+ latents_cache.append(vae.encode(batch["pixel_values"]).latent_dist)
if train_dataset.custom_instance_prompts:
with offload_models(text_encoding_pipeline, device=accelerator.device, offload=args.offload):
prompt_embeds, prompt_embeds_mask = compute_text_embeddings(
diff --git a/examples/dreambooth/train_dreambooth_lora_sana.py b/examples/dreambooth/train_dreambooth_lora_sana.py
index b188a80916..2b0c1ee669 100644
--- a/examples/dreambooth/train_dreambooth_lora_sana.py
+++ b/examples/dreambooth/train_dreambooth_lora_sana.py
@@ -25,6 +25,10 @@
# "Jinja2",
# "peft>=0.14.0",
# "sentencepiece",
+# "torchvision",
+# "datasets",
+# "bitsandbytes",
+# "prodigyopt",
# ]
# ///
diff --git a/examples/text_to_image/requirements.txt b/examples/text_to_image/requirements.txt
index c3ffa42f0e..be05fe3fcd 100644
--- a/examples/text_to_image/requirements.txt
+++ b/examples/text_to_image/requirements.txt
@@ -5,4 +5,4 @@ datasets>=2.19.1
ftfy
tensorboard
Jinja2
-peft==0.7.0
+peft>=0.17.0
diff --git a/examples/text_to_image/requirements_sdxl.txt b/examples/text_to_image/requirements_sdxl.txt
index 64cbc9205f..4dacc26ce4 100644
--- a/examples/text_to_image/requirements_sdxl.txt
+++ b/examples/text_to_image/requirements_sdxl.txt
@@ -5,4 +5,4 @@ ftfy
tensorboard
Jinja2
datasets
-peft==0.7.0
\ No newline at end of file
+peft>=0.17.0
\ No newline at end of file
diff --git a/scripts/convert_ltx_to_diffusers.py b/scripts/convert_ltx_to_diffusers.py
index 256312cc72..19e5602039 100644
--- a/scripts/convert_ltx_to_diffusers.py
+++ b/scripts/convert_ltx_to_diffusers.py
@@ -369,6 +369,15 @@ def get_spatial_latent_upsampler_config(version: str) -> Dict[str, Any]:
"spatial_upsample": True,
"temporal_upsample": False,
}
+ elif version == "0.9.8":
+ config = {
+ "in_channels": 128,
+ "mid_channels": 512,
+ "num_blocks_per_stage": 4,
+ "dims": 3,
+ "spatial_upsample": True,
+ "temporal_upsample": False,
+ }
else:
raise ValueError(f"Unsupported version: {version}")
return config
@@ -402,7 +411,7 @@ def get_args():
"--version",
type=str,
default="0.9.0",
- choices=["0.9.0", "0.9.1", "0.9.5", "0.9.7"],
+ choices=["0.9.0", "0.9.1", "0.9.5", "0.9.7", "0.9.8"],
help="Version of the LTX model",
)
return parser.parse_args()
diff --git a/setup.py b/setup.py
index 372a568595..8d346ddfec 100644
--- a/setup.py
+++ b/setup.py
@@ -145,6 +145,7 @@ _deps = [
"black",
"phonemizer",
"opencv-python",
+ "timm",
]
# this is a lookup table with items like:
@@ -218,7 +219,7 @@ class DepsTableUpdateCommand(Command):
extras = {}
extras["quality"] = deps_list("urllib3", "isort", "ruff", "hf-doc-builder")
extras["docs"] = deps_list("hf-doc-builder")
-extras["training"] = deps_list("accelerate", "datasets", "protobuf", "tensorboard", "Jinja2", "peft")
+extras["training"] = deps_list("accelerate", "datasets", "protobuf", "tensorboard", "Jinja2", "peft", "timm")
extras["test"] = deps_list(
"compel",
"GitPython",
diff --git a/src/diffusers/__init__.py b/src/diffusers/__init__.py
index 8867250ded..95d559ff75 100644
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -386,10 +386,14 @@ else:
_import_structure["modular_pipelines"].extend(
[
"FluxAutoBlocks",
+ "FluxKontextAutoBlocks",
+ "FluxKontextModularPipeline",
"FluxModularPipeline",
"QwenImageAutoBlocks",
"QwenImageEditAutoBlocks",
"QwenImageEditModularPipeline",
+ "QwenImageEditPlusAutoBlocks",
+ "QwenImageEditPlusModularPipeline",
"QwenImageModularPipeline",
"StableDiffusionXLAutoBlocks",
"StableDiffusionXLModularPipeline",
@@ -1048,10 +1052,14 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
else:
from .modular_pipelines import (
FluxAutoBlocks,
+ FluxKontextAutoBlocks,
+ FluxKontextModularPipeline,
FluxModularPipeline,
QwenImageAutoBlocks,
QwenImageEditAutoBlocks,
QwenImageEditModularPipeline,
+ QwenImageEditPlusAutoBlocks,
+ QwenImageEditPlusModularPipeline,
QwenImageModularPipeline,
StableDiffusionXLAutoBlocks,
StableDiffusionXLModularPipeline,
diff --git a/src/diffusers/dependency_versions_table.py b/src/diffusers/dependency_versions_table.py
index bfc4e9818b..6e5ac630ab 100644
--- a/src/diffusers/dependency_versions_table.py
+++ b/src/diffusers/dependency_versions_table.py
@@ -52,4 +52,5 @@ deps = {
"black": "black",
"phonemizer": "phonemizer",
"opencv-python": "opencv-python",
+ "timm": "timm",
}
diff --git a/src/diffusers/hooks/context_parallel.py b/src/diffusers/hooks/context_parallel.py
index 83406d4969..915fe453b9 100644
--- a/src/diffusers/hooks/context_parallel.py
+++ b/src/diffusers/hooks/context_parallel.py
@@ -17,7 +17,10 @@ from dataclasses import dataclass
from typing import Dict, List, Type, Union
import torch
-import torch.distributed._functional_collectives as funcol
+
+
+if torch.distributed.is_available():
+ import torch.distributed._functional_collectives as funcol
from ..models._modeling_parallel import (
ContextParallelConfig,
diff --git a/src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py b/src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py
index 7b0f9889a5..dc5e775f67 100644
--- a/src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py
+++ b/src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py
@@ -18,7 +18,6 @@ import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...utils import logging
diff --git a/src/diffusers/models/autoencoders/autoencoder_kl_qwenimage.py b/src/diffusers/models/autoencoders/autoencoder_kl_qwenimage.py
index 87ac406592..9872cf0968 100644
--- a/src/diffusers/models/autoencoders/autoencoder_kl_qwenimage.py
+++ b/src/diffusers/models/autoencoders/autoencoder_kl_qwenimage.py
@@ -23,7 +23,6 @@ from typing import List, Optional, Tuple, Union
import torch
import torch.nn as nn
import torch.nn.functional as F
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import FromOriginalModelMixin
diff --git a/src/diffusers/models/autoencoders/autoencoder_kl_wan.py b/src/diffusers/models/autoencoders/autoencoder_kl_wan.py
index e6e58c1cce..f95c4cf374 100644
--- a/src/diffusers/models/autoencoders/autoencoder_kl_wan.py
+++ b/src/diffusers/models/autoencoders/autoencoder_kl_wan.py
@@ -17,7 +17,6 @@ from typing import List, Optional, Tuple, Union
import torch
import torch.nn as nn
import torch.nn.functional as F
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import FromOriginalModelMixin
diff --git a/src/diffusers/models/controlnets/controlnet_xs.py b/src/diffusers/models/controlnets/controlnet_xs.py
index bcb4e25986..f5c69b9a46 100644
--- a/src/diffusers/models/controlnets/controlnet_xs.py
+++ b/src/diffusers/models/controlnets/controlnet_xs.py
@@ -16,7 +16,6 @@ from math import gcd
from typing import Any, Dict, List, Optional, Tuple, Union
import torch
-import torch.utils.checkpoint
from torch import Tensor, nn
from ...configuration_utils import ConfigMixin, register_to_config
diff --git a/src/diffusers/models/transformers/stable_audio_transformer.py b/src/diffusers/models/transformers/stable_audio_transformer.py
index 969e6db122..ac9b3fca41 100644
--- a/src/diffusers/models/transformers/stable_audio_transformer.py
+++ b/src/diffusers/models/transformers/stable_audio_transformer.py
@@ -18,7 +18,6 @@ from typing import Dict, Optional, Union
import numpy as np
import torch
import torch.nn as nn
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...utils import logging
diff --git a/src/diffusers/models/transformers/transformer_ltx.py b/src/diffusers/models/transformers/transformer_ltx.py
index 9f3840690d..685c73c07c 100644
--- a/src/diffusers/models/transformers/transformer_ltx.py
+++ b/src/diffusers/models/transformers/transformer_ltx.py
@@ -353,7 +353,9 @@ class LTXVideoTransformerBlock(nn.Module):
norm_hidden_states = self.norm1(hidden_states)
num_ada_params = self.scale_shift_table.shape[0]
- ada_values = self.scale_shift_table[None, None] + temb.reshape(batch_size, temb.size(1), num_ada_params, -1)
+ ada_values = self.scale_shift_table[None, None].to(temb.device) + temb.reshape(
+ batch_size, temb.size(1), num_ada_params, -1
+ )
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = ada_values.unbind(dim=2)
norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
diff --git a/src/diffusers/models/transformers/transformer_wan.py b/src/diffusers/models/transformers/transformer_wan.py
index 25c055fb56..dd75fb124f 100644
--- a/src/diffusers/models/transformers/transformer_wan.py
+++ b/src/diffusers/models/transformers/transformer_wan.py
@@ -682,12 +682,12 @@ class WanTransformer3DModel(
# 5. Output norm, projection & unpatchify
if temb.ndim == 3:
# batch_size, seq_len, inner_dim (wan 2.2 ti2v)
- shift, scale = (self.scale_shift_table.unsqueeze(0) + temb.unsqueeze(2)).chunk(2, dim=2)
+ shift, scale = (self.scale_shift_table.unsqueeze(0).to(temb.device) + temb.unsqueeze(2)).chunk(2, dim=2)
shift = shift.squeeze(2)
scale = scale.squeeze(2)
else:
# batch_size, inner_dim
- shift, scale = (self.scale_shift_table + temb.unsqueeze(1)).chunk(2, dim=1)
+ shift, scale = (self.scale_shift_table.to(temb.device) + temb.unsqueeze(1)).chunk(2, dim=1)
# Move the shift and scale tensors to the same device as hidden_states.
# When using multi-GPU inference via accelerate these will be on the
diff --git a/src/diffusers/models/transformers/transformer_wan_vace.py b/src/diffusers/models/transformers/transformer_wan_vace.py
index e5a9c7e0a6..30c38c244a 100644
--- a/src/diffusers/models/transformers/transformer_wan_vace.py
+++ b/src/diffusers/models/transformers/transformer_wan_vace.py
@@ -103,7 +103,7 @@ class WanVACETransformerBlock(nn.Module):
control_hidden_states = control_hidden_states + hidden_states
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = (
- self.scale_shift_table + temb.float()
+ self.scale_shift_table.to(temb.device) + temb.float()
).chunk(6, dim=1)
# 1. Self-attention
@@ -361,7 +361,7 @@ class WanVACETransformer3DModel(
hidden_states = hidden_states + control_hint * scale
# 6. Output norm, projection & unpatchify
- shift, scale = (self.scale_shift_table + temb.unsqueeze(1)).chunk(2, dim=1)
+ shift, scale = (self.scale_shift_table.to(temb.device) + temb.unsqueeze(1)).chunk(2, dim=1)
# Move the shift and scale tensors to the same device as hidden_states.
# When using multi-GPU inference via accelerate these will be on the
diff --git a/src/diffusers/models/unets/unet_2d_condition.py b/src/diffusers/models/unets/unet_2d_condition.py
index 33bda8cb1e..f04d3dfa01 100644
--- a/src/diffusers/models/unets/unet_2d_condition.py
+++ b/src/diffusers/models/unets/unet_2d_condition.py
@@ -16,7 +16,6 @@ from typing import Any, Dict, List, Optional, Tuple, Union
import torch
import torch.nn as nn
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import PeftAdapterMixin, UNet2DConditionLoadersMixin
diff --git a/src/diffusers/models/unets/unet_3d_condition.py b/src/diffusers/models/unets/unet_3d_condition.py
index b5151f3c9a..6a119185b8 100644
--- a/src/diffusers/models/unets/unet_3d_condition.py
+++ b/src/diffusers/models/unets/unet_3d_condition.py
@@ -18,7 +18,6 @@ from typing import Any, Dict, List, Optional, Tuple, Union
import torch
import torch.nn as nn
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import UNet2DConditionLoadersMixin
diff --git a/src/diffusers/models/unets/unet_i2vgen_xl.py b/src/diffusers/models/unets/unet_i2vgen_xl.py
index 7148723a84..3dba8edca7 100644
--- a/src/diffusers/models/unets/unet_i2vgen_xl.py
+++ b/src/diffusers/models/unets/unet_i2vgen_xl.py
@@ -16,7 +16,6 @@ from typing import Any, Dict, Optional, Tuple, Union
import torch
import torch.nn as nn
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import UNet2DConditionLoadersMixin
diff --git a/src/diffusers/models/unets/unet_kandinsky3.py b/src/diffusers/models/unets/unet_kandinsky3.py
index 423669a22f..27241ce2e6 100644
--- a/src/diffusers/models/unets/unet_kandinsky3.py
+++ b/src/diffusers/models/unets/unet_kandinsky3.py
@@ -16,7 +16,6 @@ from dataclasses import dataclass
from typing import Dict, Tuple, Union
import torch
-import torch.utils.checkpoint
from torch import nn
from ...configuration_utils import ConfigMixin, register_to_config
diff --git a/src/diffusers/models/unets/unet_motion_model.py b/src/diffusers/models/unets/unet_motion_model.py
index 26616e53bd..18d5eb917f 100644
--- a/src/diffusers/models/unets/unet_motion_model.py
+++ b/src/diffusers/models/unets/unet_motion_model.py
@@ -18,7 +18,6 @@ from typing import Any, Dict, Optional, Tuple, Union
import torch
import torch.nn as nn
import torch.nn.functional as F
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, FrozenDict, register_to_config
from ...loaders import FromOriginalModelMixin, PeftAdapterMixin, UNet2DConditionLoadersMixin
diff --git a/src/diffusers/modular_pipelines/__init__.py b/src/diffusers/modular_pipelines/__init__.py
index 65c22b349b..86ed735134 100644
--- a/src/diffusers/modular_pipelines/__init__.py
+++ b/src/diffusers/modular_pipelines/__init__.py
@@ -46,12 +46,19 @@ else:
]
_import_structure["stable_diffusion_xl"] = ["StableDiffusionXLAutoBlocks", "StableDiffusionXLModularPipeline"]
_import_structure["wan"] = ["WanAutoBlocks", "WanModularPipeline"]
- _import_structure["flux"] = ["FluxAutoBlocks", "FluxModularPipeline"]
+ _import_structure["flux"] = [
+ "FluxAutoBlocks",
+ "FluxModularPipeline",
+ "FluxKontextAutoBlocks",
+ "FluxKontextModularPipeline",
+ ]
_import_structure["qwenimage"] = [
"QwenImageAutoBlocks",
"QwenImageModularPipeline",
"QwenImageEditModularPipeline",
"QwenImageEditAutoBlocks",
+ "QwenImageEditPlusModularPipeline",
+ "QwenImageEditPlusAutoBlocks",
]
_import_structure["components_manager"] = ["ComponentsManager"]
@@ -63,7 +70,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from ..utils.dummy_pt_objects import * # noqa F403
else:
from .components_manager import ComponentsManager
- from .flux import FluxAutoBlocks, FluxModularPipeline
+ from .flux import FluxAutoBlocks, FluxKontextAutoBlocks, FluxKontextModularPipeline, FluxModularPipeline
from .modular_pipeline import (
AutoPipelineBlocks,
BlockState,
@@ -78,6 +85,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
QwenImageAutoBlocks,
QwenImageEditAutoBlocks,
QwenImageEditModularPipeline,
+ QwenImageEditPlusAutoBlocks,
+ QwenImageEditPlusModularPipeline,
QwenImageModularPipeline,
)
from .stable_diffusion_xl import StableDiffusionXLAutoBlocks, StableDiffusionXLModularPipeline
diff --git a/src/diffusers/modular_pipelines/flux/__init__.py b/src/diffusers/modular_pipelines/flux/__init__.py
index 2891edf790..ec00986611 100644
--- a/src/diffusers/modular_pipelines/flux/__init__.py
+++ b/src/diffusers/modular_pipelines/flux/__init__.py
@@ -25,14 +25,18 @@ else:
_import_structure["modular_blocks"] = [
"ALL_BLOCKS",
"AUTO_BLOCKS",
+ "AUTO_BLOCKS_KONTEXT",
+ "FLUX_KONTEXT_BLOCKS",
"TEXT2IMAGE_BLOCKS",
"FluxAutoBeforeDenoiseStep",
"FluxAutoBlocks",
- "FluxAutoBlocks",
"FluxAutoDecodeStep",
"FluxAutoDenoiseStep",
+ "FluxKontextAutoBlocks",
+ "FluxKontextAutoDenoiseStep",
+ "FluxKontextBeforeDenoiseStep",
]
- _import_structure["modular_pipeline"] = ["FluxModularPipeline"]
+ _import_structure["modular_pipeline"] = ["FluxKontextModularPipeline", "FluxModularPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
@@ -45,13 +49,18 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .modular_blocks import (
ALL_BLOCKS,
AUTO_BLOCKS,
+ AUTO_BLOCKS_KONTEXT,
+ FLUX_KONTEXT_BLOCKS,
TEXT2IMAGE_BLOCKS,
FluxAutoBeforeDenoiseStep,
FluxAutoBlocks,
FluxAutoDecodeStep,
FluxAutoDenoiseStep,
+ FluxKontextAutoBlocks,
+ FluxKontextAutoDenoiseStep,
+ FluxKontextBeforeDenoiseStep,
)
- from .modular_pipeline import FluxModularPipeline
+ from .modular_pipeline import FluxKontextModularPipeline, FluxModularPipeline
else:
import sys
diff --git a/src/diffusers/modular_pipelines/flux/before_denoise.py b/src/diffusers/modular_pipelines/flux/before_denoise.py
index 1b6e16a940..dcb0ba0e56 100644
--- a/src/diffusers/modular_pipelines/flux/before_denoise.py
+++ b/src/diffusers/modular_pipelines/flux/before_denoise.py
@@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-from typing import Any, List, Optional, Tuple
+from typing import List, Optional
import numpy as np
import torch
-from ...models import AutoencoderKL
-from ...pipelines.pipeline_utils import calculate_shift, retrieve_latents, retrieve_timesteps
+from ...pipelines import FluxPipeline
+from ...pipelines.pipeline_utils import calculate_shift, retrieve_timesteps
from ...schedulers import FlowMatchEulerDiscreteScheduler
from ...utils import logging
from ...utils.torch_utils import randn_tensor
@@ -30,85 +30,6 @@ from .modular_pipeline import FluxModularPipeline
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
-# Adapted from the original implementation.
-def prepare_latents_img2img(
- vae, scheduler, image, timestep, batch_size, num_channels_latents, height, width, dtype, device, generator
-):
- if isinstance(generator, list) and len(generator) != batch_size:
- raise ValueError(
- f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
- f" size of {batch_size}. Make sure the batch size matches the length of the generators."
- )
-
- vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
- latent_channels = vae.config.latent_channels
-
- # VAE applies 8x compression on images but we must also account for packing which requires
- # latent height and width to be divisible by 2.
- height = 2 * (int(height) // (vae_scale_factor * 2))
- width = 2 * (int(width) // (vae_scale_factor * 2))
- shape = (batch_size, num_channels_latents, height, width)
- latent_image_ids = _prepare_latent_image_ids(batch_size, height // 2, width // 2, device, dtype)
-
- image = image.to(device=device, dtype=dtype)
- if image.shape[1] != latent_channels:
- image_latents = _encode_vae_image(image=image, generator=generator)
- else:
- image_latents = image
- if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:
- # expand init_latents for batch_size
- additional_image_per_prompt = batch_size // image_latents.shape[0]
- image_latents = torch.cat([image_latents] * additional_image_per_prompt, dim=0)
- elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:
- raise ValueError(
- f"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts."
- )
- else:
- image_latents = torch.cat([image_latents], dim=0)
-
- noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
- latents = scheduler.scale_noise(image_latents, timestep, noise)
- latents = _pack_latents(latents, batch_size, num_channels_latents, height, width)
- return latents, latent_image_ids
-
-
-def _pack_latents(latents, batch_size, num_channels_latents, height, width):
- latents = latents.view(batch_size, num_channels_latents, height // 2, 2, width // 2, 2)
- latents = latents.permute(0, 2, 4, 1, 3, 5)
- latents = latents.reshape(batch_size, (height // 2) * (width // 2), num_channels_latents * 4)
-
- return latents
-
-
-def _prepare_latent_image_ids(batch_size, height, width, device, dtype):
- latent_image_ids = torch.zeros(height, width, 3)
- latent_image_ids[..., 1] = latent_image_ids[..., 1] + torch.arange(height)[:, None]
- latent_image_ids[..., 2] = latent_image_ids[..., 2] + torch.arange(width)[None, :]
-
- latent_image_id_height, latent_image_id_width, latent_image_id_channels = latent_image_ids.shape
-
- latent_image_ids = latent_image_ids.reshape(
- latent_image_id_height * latent_image_id_width, latent_image_id_channels
- )
-
- return latent_image_ids.to(device=device, dtype=dtype)
-
-
-# Cannot use "# Copied from" because it introduces weird indentation errors.
-def _encode_vae_image(vae, image: torch.Tensor, generator: torch.Generator):
- if isinstance(generator, list):
- image_latents = [
- retrieve_latents(vae.encode(image[i : i + 1]), generator=generator[i]) for i in range(image.shape[0])
- ]
- image_latents = torch.cat(image_latents, dim=0)
- else:
- image_latents = retrieve_latents(vae.encode(image), generator=generator)
-
- image_latents = (image_latents - vae.config.shift_factor) * vae.config.scaling_factor
-
- return image_latents
-
-
def _get_initial_timesteps_and_optionals(
transformer,
scheduler,
@@ -143,92 +64,6 @@ def _get_initial_timesteps_and_optionals(
return timesteps, num_inference_steps, sigmas, guidance
-class FluxInputStep(ModularPipelineBlocks):
- model_name = "flux"
-
- @property
- def description(self) -> str:
- return (
- "Input processing step that:\n"
- " 1. Determines `batch_size` and `dtype` based on `prompt_embeds`\n"
- " 2. Adjusts input tensor shapes based on `batch_size` (number of prompts) and `num_images_per_prompt`\n\n"
- "All input tensors are expected to have either batch_size=1 or match the batch_size\n"
- "of prompt_embeds. The tensors will be duplicated across the batch dimension to\n"
- "have a final batch_size of batch_size * num_images_per_prompt."
- )
-
- @property
- def inputs(self) -> List[InputParam]:
- return [
- InputParam("num_images_per_prompt", default=1),
- InputParam(
- "prompt_embeds",
- required=True,
- type_hint=torch.Tensor,
- description="Pre-generated text embeddings. Can be generated from text_encoder step.",
- ),
- InputParam(
- "pooled_prompt_embeds",
- type_hint=torch.Tensor,
- description="Pre-generated pooled text embeddings. Can be generated from text_encoder step.",
- ),
- # TODO: support negative embeddings?
- ]
-
- @property
- def intermediate_outputs(self) -> List[str]:
- return [
- OutputParam(
- "batch_size",
- type_hint=int,
- description="Number of prompts, the final batch size of model inputs should be batch_size * num_images_per_prompt",
- ),
- OutputParam(
- "dtype",
- type_hint=torch.dtype,
- description="Data type of model tensor inputs (determined by `prompt_embeds`)",
- ),
- OutputParam(
- "prompt_embeds",
- type_hint=torch.Tensor,
- description="text embeddings used to guide the image generation",
- ),
- OutputParam(
- "pooled_prompt_embeds",
- type_hint=torch.Tensor,
- description="pooled text embeddings used to guide the image generation",
- ),
- # TODO: support negative embeddings?
- ]
-
- def check_inputs(self, components, block_state):
- if block_state.prompt_embeds is not None and block_state.pooled_prompt_embeds is not None:
- if block_state.prompt_embeds.shape[0] != block_state.pooled_prompt_embeds.shape[0]:
- raise ValueError(
- "`prompt_embeds` and `pooled_prompt_embeds` must have the same batch size when passed directly, but"
- f" got: `prompt_embeds` {block_state.prompt_embeds.shape} != `pooled_prompt_embeds`"
- f" {block_state.pooled_prompt_embeds.shape}."
- )
-
- @torch.no_grad()
- def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
- # TODO: consider adding negative embeddings?
- block_state = self.get_block_state(state)
- self.check_inputs(components, block_state)
-
- block_state.batch_size = block_state.prompt_embeds.shape[0]
- block_state.dtype = block_state.prompt_embeds.dtype
-
- _, seq_len, _ = block_state.prompt_embeds.shape
- block_state.prompt_embeds = block_state.prompt_embeds.repeat(1, block_state.num_images_per_prompt, 1)
- block_state.prompt_embeds = block_state.prompt_embeds.view(
- block_state.batch_size * block_state.num_images_per_prompt, seq_len, -1
- )
- self.set_block_state(state, block_state)
-
- return components, state
-
-
class FluxSetTimestepsStep(ModularPipelineBlocks):
model_name = "flux"
@@ -297,6 +132,10 @@ class FluxSetTimestepsStep(ModularPipelineBlocks):
block_state.sigmas = sigmas
block_state.guidance = guidance
+ # We set the index here to remove DtoH sync, helpful especially during compilation.
+ # Check out more details here: https://github.com/huggingface/diffusers/pull/11696
+ components.scheduler.set_begin_index(0)
+
self.set_block_state(state, block_state)
return components, state
@@ -340,11 +179,6 @@ class FluxImg2ImgSetTimestepsStep(ModularPipelineBlocks):
type_hint=int,
description="The number of denoising steps to perform at inference time",
),
- OutputParam(
- "latent_timestep",
- type_hint=torch.Tensor,
- description="The timestep that represents the initial noise level for image-to-image generation",
- ),
OutputParam("guidance", type_hint=torch.Tensor, description="Optional guidance to be used."),
]
@@ -392,8 +226,6 @@ class FluxImg2ImgSetTimestepsStep(ModularPipelineBlocks):
block_state.sigmas = sigmas
block_state.guidance = guidance
- block_state.latent_timestep = timesteps[:1].repeat(batch_size)
-
self.set_block_state(state, block_state)
return components, state
@@ -432,11 +264,6 @@ class FluxPrepareLatentsStep(ModularPipelineBlocks):
OutputParam(
"latents", type_hint=torch.Tensor, description="The initial latents to use for the denoising process"
),
- OutputParam(
- "latent_image_ids",
- type_hint=torch.Tensor,
- description="IDs computed from the image sequence needed for RoPE",
- ),
]
@staticmethod
@@ -460,20 +287,13 @@ class FluxPrepareLatentsStep(ModularPipelineBlocks):
generator,
latents=None,
):
- # Couldn't use the `prepare_latents` method directly from Flux because I decided to copy over
- # the packing methods here. So, for example, `comp._pack_latents()` won't work if we were
- # to go with the "# Copied from ..." approach. Or maybe there's a way?
-
- # VAE applies 8x compression on images but we must also account for packing which requires
- # latent height and width to be divisible by 2.
height = 2 * (int(height) // (comp.vae_scale_factor * 2))
width = 2 * (int(width) // (comp.vae_scale_factor * 2))
shape = (batch_size, num_channels_latents, height, width)
if latents is not None:
- latent_image_ids = _prepare_latent_image_ids(batch_size, height // 2, width // 2, device, dtype)
- return latents.to(device=device, dtype=dtype), latent_image_ids
+ return latents.to(device=device, dtype=dtype)
if isinstance(generator, list) and len(generator) != batch_size:
raise ValueError(
@@ -481,26 +301,23 @@ class FluxPrepareLatentsStep(ModularPipelineBlocks):
f" size of {batch_size}. Make sure the batch size matches the length of the generators."
)
+ # TODO: move packing latents code to a patchifier similar to Qwen
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
- latents = _pack_latents(latents, batch_size, num_channels_latents, height, width)
+ latents = FluxPipeline._pack_latents(latents, batch_size, num_channels_latents, height, width)
- latent_image_ids = _prepare_latent_image_ids(batch_size, height // 2, width // 2, device, dtype)
-
- return latents, latent_image_ids
+ return latents
@torch.no_grad()
def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
-
block_state.height = block_state.height or components.default_height
block_state.width = block_state.width or components.default_width
block_state.device = components._execution_device
- block_state.dtype = torch.bfloat16 # TODO: okay to hardcode this?
block_state.num_channels_latents = components.num_channels_latents
self.check_inputs(components, block_state)
batch_size = block_state.batch_size * block_state.num_images_per_prompt
- block_state.latents, block_state.latent_image_ids = self.prepare_latents(
+ block_state.latents = self.prepare_latents(
components,
batch_size,
block_state.num_channels_latents,
@@ -520,82 +337,194 @@ class FluxPrepareLatentsStep(ModularPipelineBlocks):
class FluxImg2ImgPrepareLatentsStep(ModularPipelineBlocks):
model_name = "flux"
- @property
- def expected_components(self) -> List[ComponentSpec]:
- return [ComponentSpec("vae", AutoencoderKL), ComponentSpec("scheduler", FlowMatchEulerDiscreteScheduler)]
-
@property
def description(self) -> str:
- return "Step that prepares the latents for the image-to-image generation process"
+ return "Step that adds noise to image latents for image-to-image. Should be run after `set_timesteps`,"
+ " `prepare_latents`. Both noise and image latents should already be patchified."
@property
- def inputs(self) -> List[Tuple[str, Any]]:
+ def expected_components(self) -> List[ComponentSpec]:
+ return [ComponentSpec("scheduler", FlowMatchEulerDiscreteScheduler)]
+
+ @property
+ def inputs(self) -> List[InputParam]:
return [
- InputParam("height", type_hint=int),
- InputParam("width", type_hint=int),
- InputParam("latents", type_hint=Optional[torch.Tensor]),
- InputParam("num_images_per_prompt", type_hint=int, default=1),
- InputParam("generator"),
InputParam(
- "image_latents",
+ name="latents",
required=True,
type_hint=torch.Tensor,
- description="The latents representing the reference image for image-to-image/inpainting generation. Can be generated in vae_encode step.",
+ description="The initial random noised, can be generated in prepare latent step.",
),
InputParam(
- "latent_timestep",
+ name="image_latents",
required=True,
type_hint=torch.Tensor,
- description="The timestep that represents the initial noise level for image-to-image/inpainting generation. Can be generated in set_timesteps step.",
+ description="The image latents to use for the denoising process. Can be generated in vae encoder and packed in input step.",
),
InputParam(
- "batch_size",
+ name="timesteps",
required=True,
- type_hint=int,
- description="Number of prompts, the final batch size of model inputs should be batch_size * num_images_per_prompt. Can be generated in input step.",
+ type_hint=torch.Tensor,
+ description="The timesteps to use for the denoising process. Can be generated in set_timesteps step.",
),
- InputParam("dtype", required=True, type_hint=torch.dtype, description="The dtype of the model inputs"),
]
@property
def intermediate_outputs(self) -> List[OutputParam]:
return [
OutputParam(
- "latents", type_hint=torch.Tensor, description="The initial latents to use for the denoising process"
- ),
- OutputParam(
- "latent_image_ids",
+ name="initial_noise",
type_hint=torch.Tensor,
- description="IDs computed from the image sequence needed for RoPE",
+ description="The initial random noised used for inpainting denoising.",
),
]
+ @staticmethod
+ def check_inputs(image_latents, latents):
+ if image_latents.shape[0] != latents.shape[0]:
+ raise ValueError(
+ f"`image_latents` must have have same batch size as `latents`, but got {image_latents.shape[0]} and {latents.shape[0]}"
+ )
+
+ if image_latents.ndim != 3:
+ raise ValueError(f"`image_latents` must have 3 dimensions (patchified), but got {image_latents.ndim}")
+
@torch.no_grad()
def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
- block_state.device = components._execution_device
- block_state.dtype = torch.bfloat16 # TODO: okay to hardcode this?
- block_state.num_channels_latents = components.num_channels_latents
- block_state.dtype = block_state.dtype if block_state.dtype is not None else components.vae.dtype
- block_state.device = components._execution_device
+ self.check_inputs(image_latents=block_state.image_latents, latents=block_state.latents)
- # TODO: implement `check_inputs`
- batch_size = block_state.batch_size * block_state.num_images_per_prompt
- if block_state.latents is None:
- block_state.latents, block_state.latent_image_ids = prepare_latents_img2img(
- components.vae,
- components.scheduler,
- block_state.image_latents,
- block_state.latent_timestep,
- batch_size,
- block_state.num_channels_latents,
- block_state.height,
- block_state.width,
- block_state.dtype,
- block_state.device,
- block_state.generator,
- )
+ # prepare latent timestep
+ latent_timestep = block_state.timesteps[:1].repeat(block_state.latents.shape[0])
+
+ # make copy of initial_noise
+ block_state.initial_noise = block_state.latents
+
+ # scale noise
+ block_state.latents = components.scheduler.scale_noise(
+ block_state.image_latents, latent_timestep, block_state.latents
+ )
+
+ self.set_block_state(state, block_state)
+
+ return components, state
+
+
+class FluxRoPEInputsStep(ModularPipelineBlocks):
+ model_name = "flux"
+
+ @property
+ def description(self) -> str:
+ return "Step that prepares the RoPE inputs for the denoising process. Should be placed after text encoder and latent preparation steps."
+
+ @property
+ def inputs(self) -> List[InputParam]:
+ return [
+ InputParam(name="height", required=True),
+ InputParam(name="width", required=True),
+ InputParam(name="prompt_embeds"),
+ ]
+
+ @property
+ def intermediate_outputs(self) -> List[OutputParam]:
+ return [
+ OutputParam(
+ name="txt_ids",
+ kwargs_type="denoiser_input_fields",
+ type_hint=List[int],
+ description="The sequence lengths of the prompt embeds, used for RoPE calculation.",
+ ),
+ OutputParam(
+ name="img_ids",
+ kwargs_type="denoiser_input_fields",
+ type_hint=List[int],
+ description="The sequence lengths of the image latents, used for RoPE calculation.",
+ ),
+ ]
+
+ def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
+ block_state = self.get_block_state(state)
+
+ prompt_embeds = block_state.prompt_embeds
+ device, dtype = prompt_embeds.device, prompt_embeds.dtype
+ block_state.txt_ids = torch.zeros(prompt_embeds.shape[1], 3).to(
+ device=prompt_embeds.device, dtype=prompt_embeds.dtype
+ )
+
+ height = 2 * (int(block_state.height) // (components.vae_scale_factor * 2))
+ width = 2 * (int(block_state.width) // (components.vae_scale_factor * 2))
+ block_state.img_ids = FluxPipeline._prepare_latent_image_ids(None, height // 2, width // 2, device, dtype)
+
+ self.set_block_state(state, block_state)
+
+ return components, state
+
+
+class FluxKontextRoPEInputsStep(ModularPipelineBlocks):
+ model_name = "flux-kontext"
+
+ @property
+ def description(self) -> str:
+ return "Step that prepares the RoPE inputs for the denoising process of Flux Kontext. Should be placed after text encoder and latent preparation steps."
+
+ @property
+ def inputs(self) -> List[InputParam]:
+ return [
+ InputParam(name="image_height"),
+ InputParam(name="image_width"),
+ InputParam(name="height"),
+ InputParam(name="width"),
+ InputParam(name="prompt_embeds"),
+ ]
+
+ @property
+ def intermediate_outputs(self) -> List[OutputParam]:
+ return [
+ OutputParam(
+ name="txt_ids",
+ kwargs_type="denoiser_input_fields",
+ type_hint=List[int],
+ description="The sequence lengths of the prompt embeds, used for RoPE calculation.",
+ ),
+ OutputParam(
+ name="img_ids",
+ kwargs_type="denoiser_input_fields",
+ type_hint=List[int],
+ description="The sequence lengths of the image latents, used for RoPE calculation.",
+ ),
+ ]
+
+ def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
+ block_state = self.get_block_state(state)
+
+ prompt_embeds = block_state.prompt_embeds
+ device, dtype = prompt_embeds.device, prompt_embeds.dtype
+ block_state.txt_ids = torch.zeros(prompt_embeds.shape[1], 3).to(
+ device=prompt_embeds.device, dtype=prompt_embeds.dtype
+ )
+
+ img_ids = None
+ if (
+ getattr(block_state, "image_height", None) is not None
+ and getattr(block_state, "image_width", None) is not None
+ ):
+ image_latent_height = 2 * (int(block_state.image_height) // (components.vae_scale_factor * 2))
+ image_latent_width = 2 * (int(block_state.width) // (components.vae_scale_factor * 2))
+ img_ids = FluxPipeline._prepare_latent_image_ids(
+ None, image_latent_height // 2, image_latent_width // 2, device, dtype
+ )
+ # image ids are the same as latent ids with the first dimension set to 1 instead of 0
+ img_ids[..., 0] = 1
+
+ height = 2 * (int(block_state.height) // (components.vae_scale_factor * 2))
+ width = 2 * (int(block_state.width) // (components.vae_scale_factor * 2))
+ latent_ids = FluxPipeline._prepare_latent_image_ids(None, height // 2, width // 2, device, dtype)
+
+ if img_ids is not None:
+ latent_ids = torch.cat([latent_ids, img_ids], dim=0)
+
+ block_state.img_ids = latent_ids
self.set_block_state(state, block_state)
diff --git a/src/diffusers/modular_pipelines/flux/denoise.py b/src/diffusers/modular_pipelines/flux/denoise.py
index ffa0a4456f..b1796bb63c 100644
--- a/src/diffusers/modular_pipelines/flux/denoise.py
+++ b/src/diffusers/modular_pipelines/flux/denoise.py
@@ -76,18 +76,17 @@ class FluxLoopDenoiser(ModularPipelineBlocks):
description="Pooled prompt embeddings",
),
InputParam(
- "text_ids",
+ "txt_ids",
required=True,
type_hint=torch.Tensor,
description="IDs computed from text sequence needed for RoPE",
),
InputParam(
- "latent_image_ids",
+ "img_ids",
required=True,
type_hint=torch.Tensor,
description="IDs computed from image sequence needed for RoPE",
),
- # TODO: guidance
]
@torch.no_grad()
@@ -101,8 +100,8 @@ class FluxLoopDenoiser(ModularPipelineBlocks):
encoder_hidden_states=block_state.prompt_embeds,
pooled_projections=block_state.pooled_prompt_embeds,
joint_attention_kwargs=block_state.joint_attention_kwargs,
- txt_ids=block_state.text_ids,
- img_ids=block_state.latent_image_ids,
+ txt_ids=block_state.txt_ids,
+ img_ids=block_state.img_ids,
return_dict=False,
)[0]
block_state.noise_pred = noise_pred
@@ -110,6 +109,96 @@ class FluxLoopDenoiser(ModularPipelineBlocks):
return components, block_state
+class FluxKontextLoopDenoiser(ModularPipelineBlocks):
+ model_name = "flux-kontext"
+
+ @property
+ def expected_components(self) -> List[ComponentSpec]:
+ return [ComponentSpec("transformer", FluxTransformer2DModel)]
+
+ @property
+ def description(self) -> str:
+ return (
+ "Step within the denoising loop that denoise the latents for Flux Kontext. "
+ "This block should be used to compose the `sub_blocks` attribute of a `LoopSequentialPipelineBlocks` "
+ "object (e.g. `FluxDenoiseLoopWrapper`)"
+ )
+
+ @property
+ def inputs(self) -> List[Tuple[str, Any]]:
+ return [
+ InputParam("joint_attention_kwargs"),
+ InputParam(
+ "latents",
+ required=True,
+ type_hint=torch.Tensor,
+ description="The initial latents to use for the denoising process. Can be generated in prepare_latent step.",
+ ),
+ InputParam(
+ "image_latents",
+ type_hint=torch.Tensor,
+ description="Image latents to use for the denoising process. Can be generated in prepare_latent step.",
+ ),
+ InputParam(
+ "guidance",
+ required=True,
+ type_hint=torch.Tensor,
+ description="Guidance scale as a tensor",
+ ),
+ InputParam(
+ "prompt_embeds",
+ required=True,
+ type_hint=torch.Tensor,
+ description="Prompt embeddings",
+ ),
+ InputParam(
+ "pooled_prompt_embeds",
+ required=True,
+ type_hint=torch.Tensor,
+ description="Pooled prompt embeddings",
+ ),
+ InputParam(
+ "txt_ids",
+ required=True,
+ type_hint=torch.Tensor,
+ description="IDs computed from text sequence needed for RoPE",
+ ),
+ InputParam(
+ "img_ids",
+ required=True,
+ type_hint=torch.Tensor,
+ description="IDs computed from latent sequence needed for RoPE",
+ ),
+ ]
+
+ @torch.no_grad()
+ def __call__(
+ self, components: FluxModularPipeline, block_state: BlockState, i: int, t: torch.Tensor
+ ) -> PipelineState:
+ latents = block_state.latents
+ latent_model_input = latents
+ image_latents = block_state.image_latents
+ if image_latents is not None:
+ latent_model_input = torch.cat([latent_model_input, image_latents], dim=1)
+
+ timestep = t.expand(latents.shape[0]).to(latents.dtype)
+ noise_pred = components.transformer(
+ hidden_states=latent_model_input,
+ timestep=timestep / 1000,
+ guidance=block_state.guidance,
+ encoder_hidden_states=block_state.prompt_embeds,
+ pooled_projections=block_state.pooled_prompt_embeds,
+ joint_attention_kwargs=block_state.joint_attention_kwargs,
+ txt_ids=block_state.txt_ids,
+ img_ids=block_state.img_ids,
+ return_dict=False,
+ )[0]
+ noise_pred = noise_pred[:, : latents.size(1)]
+ block_state.noise_pred = noise_pred
+
+ return components, block_state
+
+
class FluxLoopAfterDenoiser(ModularPipelineBlocks):
model_name = "flux"
@@ -195,9 +284,6 @@ class FluxDenoiseLoopWrapper(LoopSequentialPipelineBlocks):
block_state.num_warmup_steps = max(
len(block_state.timesteps) - block_state.num_inference_steps * components.scheduler.order, 0
)
- # We set the index here to remove DtoH sync, helpful especially during compilation.
- # Check out more details here: https://github.com/huggingface/diffusers/pull/11696
- components.scheduler.set_begin_index(0)
with self.progress_bar(total=block_state.num_inference_steps) as progress_bar:
for i, t in enumerate(block_state.timesteps):
components, block_state = self.loop_step(components, block_state, i=i, t=t)
@@ -225,3 +311,20 @@ class FluxDenoiseStep(FluxDenoiseLoopWrapper):
" - `FluxLoopAfterDenoiser`\n"
"This block supports both text2image and img2img tasks."
)
+
+
+class FluxKontextDenoiseStep(FluxDenoiseLoopWrapper):
+ model_name = "flux-kontext"
+ block_classes = [FluxKontextLoopDenoiser, FluxLoopAfterDenoiser]
+ block_names = ["denoiser", "after_denoiser"]
+
+ @property
+ def description(self) -> str:
+ return (
+ "Denoise step that iteratively denoise the latents. \n"
+ "Its loop logic is defined in `FluxDenoiseLoopWrapper.__call__` method \n"
+ "At each iteration, it runs blocks defined in `sub_blocks` sequentially:\n"
+ " - `FluxKontextLoopDenoiser`\n"
+ " - `FluxLoopAfterDenoiser`\n"
+ "This block supports both text2image and img2img tasks."
+ )
diff --git a/src/diffusers/modular_pipelines/flux/encoders.py b/src/diffusers/modular_pipelines/flux/encoders.py
index 43fd419108..9b9f53fc00 100644
--- a/src/diffusers/modular_pipelines/flux/encoders.py
+++ b/src/diffusers/modular_pipelines/flux/encoders.py
@@ -20,13 +20,13 @@ import torch
from transformers import CLIPTextModel, CLIPTokenizer, T5EncoderModel, T5TokenizerFast
from ...configuration_utils import FrozenDict
-from ...image_processor import VaeImageProcessor
+from ...image_processor import VaeImageProcessor, is_valid_image, is_valid_image_imagelist
from ...loaders import FluxLoraLoaderMixin, TextualInversionLoaderMixin
from ...models import AutoencoderKL
from ...pipelines.pipeline_utils import retrieve_latents
from ...utils import USE_PEFT_BACKEND, is_ftfy_available, logging, scale_lora_layers, unscale_lora_layers
from ..modular_pipeline import ModularPipelineBlocks, PipelineState
-from ..modular_pipeline_utils import ComponentSpec, ConfigSpec, InputParam, OutputParam
+from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
from .modular_pipeline import FluxModularPipeline
@@ -54,89 +54,219 @@ def prompt_clean(text):
return text
-class FluxVaeEncoderStep(ModularPipelineBlocks):
+def encode_vae_image(vae: AutoencoderKL, image: torch.Tensor, generator: torch.Generator, sample_mode="sample"):
+ if isinstance(generator, list):
+ image_latents = [
+ retrieve_latents(vae.encode(image[i : i + 1]), generator=generator[i], sample_mode=sample_mode)
+ for i in range(image.shape[0])
+ ]
+ image_latents = torch.cat(image_latents, dim=0)
+ else:
+ image_latents = retrieve_latents(vae.encode(image), generator=generator, sample_mode=sample_mode)
+
+ image_latents = (image_latents - vae.config.shift_factor) * vae.config.scaling_factor
+
+ return image_latents
+
+
+class FluxProcessImagesInputStep(ModularPipelineBlocks):
model_name = "flux"
@property
def description(self) -> str:
- return "Vae Encoder step that encode the input image into a latent representation"
+ return "Image Preprocess step."
@property
def expected_components(self) -> List[ComponentSpec]:
return [
- ComponentSpec("vae", AutoencoderKL),
ComponentSpec(
"image_processor",
VaeImageProcessor,
- config=FrozenDict({"vae_scale_factor": 16, "vae_latent_channels": 16}),
+ config=FrozenDict({"vae_scale_factor": 16}),
default_creation_method="from_config",
),
]
@property
def inputs(self) -> List[InputParam]:
+ return [InputParam("resized_image"), InputParam("image"), InputParam("height"), InputParam("width")]
+
+ @property
+ def intermediate_outputs(self) -> List[OutputParam]:
+ return [OutputParam(name="processed_image")]
+
+ @staticmethod
+ def check_inputs(height, width, vae_scale_factor):
+ if height is not None and height % (vae_scale_factor * 2) != 0:
+ raise ValueError(f"Height must be divisible by {vae_scale_factor * 2} but is {height}")
+
+ if width is not None and width % (vae_scale_factor * 2) != 0:
+ raise ValueError(f"Width must be divisible by {vae_scale_factor * 2} but is {width}")
+
+ @torch.no_grad()
+ def __call__(self, components: FluxModularPipeline, state: PipelineState):
+ block_state = self.get_block_state(state)
+
+ if block_state.resized_image is None and block_state.image is None:
+ raise ValueError("`resized_image` and `image` cannot be None at the same time")
+
+ if block_state.resized_image is None:
+ image = block_state.image
+ self.check_inputs(
+ height=block_state.height, width=block_state.width, vae_scale_factor=components.vae_scale_factor
+ )
+ height = block_state.height or components.default_height
+ width = block_state.width or components.default_width
+ else:
+ width, height = block_state.resized_image[0].size
+ image = block_state.resized_image
+
+ block_state.processed_image = components.image_processor.preprocess(image=image, height=height, width=width)
+
+ self.set_block_state(state, block_state)
+ return components, state
+
+
+class FluxKontextProcessImagesInputStep(ModularPipelineBlocks):
+ model_name = "flux-kontext"
+
+ def __init__(self, _auto_resize=True):
+ self._auto_resize = _auto_resize
+ super().__init__()
+
+ @property
+ def description(self) -> str:
+ return (
+ "Image preprocess step for Flux Kontext. The preprocessed image goes to the VAE.\n"
+ "Kontext works as a T2I model, too, in case no input image is provided."
+ )
+
+ @property
+ def expected_components(self) -> List[ComponentSpec]:
return [
- InputParam("image", required=True),
- InputParam("height"),
- InputParam("width"),
- InputParam("generator"),
- InputParam("dtype", type_hint=torch.dtype, description="Data type of model tensor inputs"),
- InputParam(
- "preprocess_kwargs",
- type_hint=Optional[dict],
- description="A kwargs dictionary that if specified is passed along to the `ImageProcessor` as defined under `self.image_processor` in [diffusers.image_processor.VaeImageProcessor]",
+ ComponentSpec(
+ "image_processor",
+ VaeImageProcessor,
+ config=FrozenDict({"vae_scale_factor": 16}),
+ default_creation_method="from_config",
),
]
+ @property
+ def inputs(self) -> List[InputParam]:
+ return [InputParam("image")]
+
+ @property
+ def intermediate_outputs(self) -> List[OutputParam]:
+ return [OutputParam(name="processed_image")]
+
+ @torch.no_grad()
+ def __call__(self, components: FluxModularPipeline, state: PipelineState):
+ from ...pipelines.flux.pipeline_flux_kontext import PREFERRED_KONTEXT_RESOLUTIONS
+
+ block_state = self.get_block_state(state)
+ images = block_state.image
+
+ if images is None:
+ block_state.processed_image = None
+
+ else:
+ multiple_of = components.image_processor.config.vae_scale_factor
+
+ if not is_valid_image_imagelist(images):
+ raise ValueError(f"Images must be image or list of images but are {type(images)}")
+
+ if is_valid_image(images):
+ images = [images]
+
+ img = images[0]
+ image_height, image_width = components.image_processor.get_default_height_width(img)
+ aspect_ratio = image_width / image_height
+ if self._auto_resize:
+ # Kontext is trained on specific resolutions, using one of them is recommended
+ _, image_width, image_height = min(
+ (abs(aspect_ratio - w / h), w, h) for w, h in PREFERRED_KONTEXT_RESOLUTIONS
+ )
+ image_width = image_width // multiple_of * multiple_of
+ image_height = image_height // multiple_of * multiple_of
+ images = components.image_processor.resize(images, image_height, image_width)
+ block_state.processed_image = components.image_processor.preprocess(images, image_height, image_width)
+
+ self.set_block_state(state, block_state)
+ return components, state
+
+
+class FluxVaeEncoderDynamicStep(ModularPipelineBlocks):
+ model_name = "flux"
+
+ def __init__(
+ self, input_name: str = "processed_image", output_name: str = "image_latents", sample_mode: str = "sample"
+ ):
+ """Initialize a VAE encoder step for converting images to latent representations.
+
+ Both the input and output names are configurable so this block can be configured to process to different image
+ inputs (e.g., "processed_image" -> "image_latents", "processed_control_image" -> "control_image_latents").
+
+ Args:
+ input_name (str, optional): Name of the input image tensor. Defaults to "processed_image".
+ Examples: "processed_image" or "processed_control_image"
+ output_name (str, optional): Name of the output latent tensor. Defaults to "image_latents".
+ Examples: "image_latents" or "control_image_latents"
+ sample_mode (str, optional): Sampling mode to be used.
+
+ Examples:
+ # Basic usage with default settings (includes image processor): # FluxImageVaeEncoderDynamicStep()
+
+ # Custom input/output names for control image: # FluxImageVaeEncoderDynamicStep(
+ input_name="processed_control_image", output_name="control_image_latents"
+ )
+ """
+ self._image_input_name = input_name
+ self._image_latents_output_name = output_name
+ self.sample_mode = sample_mode
+ super().__init__()
+
+ @property
+ def description(self) -> str:
+ return f"Dynamic VAE Encoder step that converts {self._image_input_name} into latent representations {self._image_latents_output_name}.\n"
+
+ @property
+ def expected_components(self) -> List[ComponentSpec]:
+ components = [ComponentSpec("vae", AutoencoderKL)]
+ return components
+
+ @property
+ def inputs(self) -> List[InputParam]:
+ inputs = [InputParam(self._image_input_name), InputParam("generator")]
+ return inputs
+
@property
def intermediate_outputs(self) -> List[OutputParam]:
return [
OutputParam(
- "image_latents",
+ self._image_latents_output_name,
type_hint=torch.Tensor,
- description="The latents representing the reference image for image-to-image/inpainting generation",
+ description="The latents representing the reference image",
)
]
- @staticmethod
- # Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3_inpaint.StableDiffusion3InpaintPipeline._encode_vae_image with self.vae->vae
- def _encode_vae_image(vae, image: torch.Tensor, generator: torch.Generator):
- if isinstance(generator, list):
- image_latents = [
- retrieve_latents(vae.encode(image[i : i + 1]), generator=generator[i]) for i in range(image.shape[0])
- ]
- image_latents = torch.cat(image_latents, dim=0)
- else:
- image_latents = retrieve_latents(vae.encode(image), generator=generator)
-
- image_latents = (image_latents - vae.config.shift_factor) * vae.config.scaling_factor
-
- return image_latents
-
@torch.no_grad()
def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
- block_state.preprocess_kwargs = block_state.preprocess_kwargs or {}
- block_state.device = components._execution_device
- block_state.dtype = block_state.dtype if block_state.dtype is not None else components.vae.dtype
+ image = getattr(block_state, self._image_input_name)
- block_state.image = components.image_processor.preprocess(
- block_state.image, height=block_state.height, width=block_state.width, **block_state.preprocess_kwargs
- )
- block_state.image = block_state.image.to(device=block_state.device, dtype=block_state.dtype)
+ if image is None:
+ setattr(block_state, self._image_latents_output_name, None)
+ else:
+ device = components._execution_device
+ dtype = components.vae.dtype
+ image = image.to(device=device, dtype=dtype)
- block_state.batch_size = block_state.image.shape[0]
-
- # if generator is a list, make sure the length of it matches the length of images (both should be batch_size)
- if isinstance(block_state.generator, list) and len(block_state.generator) != block_state.batch_size:
- raise ValueError(
- f"You have passed a list of generators of length {len(block_state.generator)}, but requested an effective batch"
- f" size of {block_state.batch_size}. Make sure the batch size matches the length of the generators."
+ # Encode image into latents
+ image_latents = encode_vae_image(
+ image=image, vae=components.vae, generator=block_state.generator, sample_mode=self.sample_mode
)
-
- block_state.image_latents = self._encode_vae_image(
- components.vae, image=block_state.image, generator=block_state.generator
- )
+ setattr(block_state, self._image_latents_output_name, image_latents)
self.set_block_state(state, block_state)
@@ -148,7 +278,7 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
@property
def description(self) -> str:
- return "Text Encoder step that generate text_embeddings to guide the video generation"
+ return "Text Encoder step that generate text_embeddings to guide the image generation"
@property
def expected_components(self) -> List[ComponentSpec]:
@@ -159,15 +289,12 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
ComponentSpec("tokenizer_2", T5TokenizerFast),
]
- @property
- def expected_configs(self) -> List[ConfigSpec]:
- return []
-
@property
def inputs(self) -> List[InputParam]:
return [
InputParam("prompt"),
InputParam("prompt_2"),
+ InputParam("max_sequence_length", type_hint=int, default=512, required=False),
InputParam("joint_attention_kwargs"),
]
@@ -176,19 +303,16 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
return [
OutputParam(
"prompt_embeds",
+ kwargs_type="denoiser_input_fields",
type_hint=torch.Tensor,
description="text embeddings used to guide the image generation",
),
OutputParam(
"pooled_prompt_embeds",
+ kwargs_type="denoiser_input_fields",
type_hint=torch.Tensor,
description="pooled text embeddings used to guide the image generation",
),
- OutputParam(
- "text_ids",
- type_hint=torch.Tensor,
- description="ids from the text sequence for RoPE",
- ),
]
@staticmethod
@@ -199,16 +323,10 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
@staticmethod
def _get_t5_prompt_embeds(
- components,
- prompt: Union[str, List[str]],
- num_images_per_prompt: int,
- max_sequence_length: int,
- device: torch.device,
+ components, prompt: Union[str, List[str]], max_sequence_length: int, device: torch.device
):
dtype = components.text_encoder_2.dtype
-
prompt = [prompt] if isinstance(prompt, str) else prompt
- batch_size = len(prompt)
if isinstance(components, TextualInversionLoaderMixin):
prompt = components.maybe_convert_prompt(prompt, components.tokenizer_2)
@@ -234,23 +352,11 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
prompt_embeds = components.text_encoder_2(text_input_ids.to(device), output_hidden_states=False)[0]
prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
- _, seq_len, _ = prompt_embeds.shape
-
- # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
- prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
- prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
-
return prompt_embeds
@staticmethod
- def _get_clip_prompt_embeds(
- components,
- prompt: Union[str, List[str]],
- num_images_per_prompt: int,
- device: torch.device,
- ):
+ def _get_clip_prompt_embeds(components, prompt: Union[str, List[str]], device: torch.device):
prompt = [prompt] if isinstance(prompt, str) else prompt
- batch_size = len(prompt)
if isinstance(components, TextualInversionLoaderMixin):
prompt = components.maybe_convert_prompt(prompt, components.tokenizer)
@@ -280,10 +386,6 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
prompt_embeds = prompt_embeds.pooler_output
prompt_embeds = prompt_embeds.to(dtype=components.text_encoder.dtype, device=device)
- # duplicate text embeddings for each generation per prompt, using mps friendly method
- prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt)
- prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, -1)
-
return prompt_embeds
@staticmethod
@@ -292,34 +394,11 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
prompt: Union[str, List[str]],
prompt_2: Union[str, List[str]],
device: Optional[torch.device] = None,
- num_images_per_prompt: int = 1,
prompt_embeds: Optional[torch.FloatTensor] = None,
pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
max_sequence_length: int = 512,
lora_scale: Optional[float] = None,
):
- r"""
- Encodes the prompt into text encoder hidden states.
-
- Args:
- prompt (`str` or `List[str]`, *optional*):
- prompt to be encoded
- prompt_2 (`str` or `List[str]`, *optional*):
- The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
- used in all text-encoders
- device: (`torch.device`):
- torch device
- num_images_per_prompt (`int`):
- number of images that should be generated per prompt
- prompt_embeds (`torch.FloatTensor`, *optional*):
- Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
- provided, text embeddings will be generated from `prompt` input argument.
- pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
- Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
- If not provided, pooled text embeddings will be generated from `prompt` input argument.
- lora_scale (`float`, *optional*):
- A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
- """
device = device or components._execution_device
# set lora scale so that monkey patched LoRA
@@ -344,12 +423,10 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
components,
prompt=prompt,
device=device,
- num_images_per_prompt=num_images_per_prompt,
)
prompt_embeds = FluxTextEncoderStep._get_t5_prompt_embeds(
components,
prompt=prompt_2,
- num_images_per_prompt=num_images_per_prompt,
max_sequence_length=max_sequence_length,
device=device,
)
@@ -364,10 +441,7 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
# Retrieve the original scale by scaling back the LoRA layers
unscale_lora_layers(components.text_encoder_2, lora_scale)
- dtype = components.text_encoder.dtype if components.text_encoder is not None else torch.bfloat16
- text_ids = torch.zeros(prompt_embeds.shape[1], 3).to(device=device, dtype=dtype)
-
- return prompt_embeds, pooled_prompt_embeds, text_ids
+ return prompt_embeds, pooled_prompt_embeds
@torch.no_grad()
def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
@@ -383,14 +457,14 @@ class FluxTextEncoderStep(ModularPipelineBlocks):
if block_state.joint_attention_kwargs is not None
else None
)
- (block_state.prompt_embeds, block_state.pooled_prompt_embeds, block_state.text_ids) = self.encode_prompt(
+ block_state.prompt_embeds, block_state.pooled_prompt_embeds = self.encode_prompt(
components,
prompt=block_state.prompt,
prompt_2=None,
prompt_embeds=None,
pooled_prompt_embeds=None,
device=block_state.device,
- num_images_per_prompt=1, # TODO: hardcoded for now.
+ max_sequence_length=block_state.max_sequence_length,
lora_scale=block_state.text_encoder_lora_scale,
)
diff --git a/src/diffusers/modular_pipelines/flux/inputs.py b/src/diffusers/modular_pipelines/flux/inputs.py
new file mode 100644
index 0000000000..e1bc17f5ff
--- /dev/null
+++ b/src/diffusers/modular_pipelines/flux/inputs.py
@@ -0,0 +1,359 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List
+
+import torch
+
+from ...pipelines import FluxPipeline
+from ...utils import logging
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
+from ..modular_pipeline_utils import InputParam, OutputParam
+
+# TODO: consider making these common utilities for modular if they are not pipeline-specific.
+from ..qwenimage.inputs import calculate_dimension_from_latents, repeat_tensor_to_batch_size
+from .modular_pipeline import FluxModularPipeline
+
+
+logger = logging.get_logger(__name__)
+
+
+class FluxTextInputStep(ModularPipelineBlocks):
+ model_name = "flux"
+
+ @property
+ def description(self) -> str:
+ return (
+ "Text input processing step that standardizes text embeddings for the pipeline.\n"
+ "This step:\n"
+ " 1. Determines `batch_size` and `dtype` based on `prompt_embeds`\n"
+ " 2. Ensures all text embeddings have consistent batch sizes (batch_size * num_images_per_prompt)"
+ )
+
+ @property
+ def inputs(self) -> List[InputParam]:
+ return [
+ InputParam("num_images_per_prompt", default=1),
+ InputParam(
+ "prompt_embeds",
+ required=True,
+ kwargs_type="denoiser_input_fields",
+ type_hint=torch.Tensor,
+ description="Pre-generated text embeddings. Can be generated from text_encoder step.",
+ ),
+ InputParam(
+ "pooled_prompt_embeds",
+ kwargs_type="denoiser_input_fields",
+ type_hint=torch.Tensor,
+ description="Pre-generated pooled text embeddings. Can be generated from text_encoder step.",
+ ),
+ # TODO: support negative embeddings?
+ ]
+
+ @property
+ def intermediate_outputs(self) -> List[str]:
+ return [
+ OutputParam(
+ "batch_size",
+ type_hint=int,
+ description="Number of prompts, the final batch size of model inputs should be batch_size * num_images_per_prompt",
+ ),
+ OutputParam(
+ "dtype",
+ type_hint=torch.dtype,
+ description="Data type of model tensor inputs (determined by `prompt_embeds`)",
+ ),
+ OutputParam(
+ "prompt_embeds",
+ type_hint=torch.Tensor,
+ kwargs_type="denoiser_input_fields",
+ description="text embeddings used to guide the image generation",
+ ),
+ OutputParam(
+ "pooled_prompt_embeds",
+ type_hint=torch.Tensor,
+ kwargs_type="denoiser_input_fields",
+ description="pooled text embeddings used to guide the image generation",
+ ),
+ # TODO: support negative embeddings?
+ ]
+
+ def check_inputs(self, components, block_state):
+ if block_state.prompt_embeds is not None and block_state.pooled_prompt_embeds is not None:
+ if block_state.prompt_embeds.shape[0] != block_state.pooled_prompt_embeds.shape[0]:
+ raise ValueError(
+ "`prompt_embeds` and `pooled_prompt_embeds` must have the same batch size when passed directly, but"
+ f" got: `prompt_embeds` {block_state.prompt_embeds.shape} != `pooled_prompt_embeds`"
+ f" {block_state.pooled_prompt_embeds.shape}."
+ )
+
+ @torch.no_grad()
+ def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
+ # TODO: consider adding negative embeddings?
+ block_state = self.get_block_state(state)
+ self.check_inputs(components, block_state)
+
+ block_state.batch_size = block_state.prompt_embeds.shape[0]
+ block_state.dtype = block_state.prompt_embeds.dtype
+
+ _, seq_len, _ = block_state.prompt_embeds.shape
+ block_state.prompt_embeds = block_state.prompt_embeds.repeat(1, block_state.num_images_per_prompt, 1)
+ block_state.prompt_embeds = block_state.prompt_embeds.view(
+ block_state.batch_size * block_state.num_images_per_prompt, seq_len, -1
+ )
+ self.set_block_state(state, block_state)
+
+ return components, state
+
+
+# Adapted from `QwenImageInputsDynamicStep`
+class FluxInputsDynamicStep(ModularPipelineBlocks):
+ model_name = "flux"
+
+ def __init__(
+ self,
+ image_latent_inputs: List[str] = ["image_latents"],
+ additional_batch_inputs: List[str] = [],
+ ):
+ if not isinstance(image_latent_inputs, list):
+ image_latent_inputs = [image_latent_inputs]
+ if not isinstance(additional_batch_inputs, list):
+ additional_batch_inputs = [additional_batch_inputs]
+
+ self._image_latent_inputs = image_latent_inputs
+ self._additional_batch_inputs = additional_batch_inputs
+ super().__init__()
+
+ @property
+ def description(self) -> str:
+ # Functionality section
+ summary_section = (
+ "Input processing step that:\n"
+ " 1. For image latent inputs: Updates height/width if None, patchifies latents, and expands batch size\n"
+ " 2. For additional batch inputs: Expands batch dimensions to match final batch size"
+ )
+
+ # Inputs info
+ inputs_info = ""
+ if self._image_latent_inputs or self._additional_batch_inputs:
+ inputs_info = "\n\nConfigured inputs:"
+ if self._image_latent_inputs:
+ inputs_info += f"\n - Image latent inputs: {self._image_latent_inputs}"
+ if self._additional_batch_inputs:
+ inputs_info += f"\n - Additional batch inputs: {self._additional_batch_inputs}"
+
+ # Placement guidance
+ placement_section = "\n\nThis block should be placed after the encoder steps and the text input step."
+
+ return summary_section + inputs_info + placement_section
+
+ @property
+ def inputs(self) -> List[InputParam]:
+ inputs = [
+ InputParam(name="num_images_per_prompt", default=1),
+ InputParam(name="batch_size", required=True),
+ InputParam(name="height"),
+ InputParam(name="width"),
+ ]
+
+ # Add image latent inputs
+ for image_latent_input_name in self._image_latent_inputs:
+ inputs.append(InputParam(name=image_latent_input_name))
+
+ # Add additional batch inputs
+ for input_name in self._additional_batch_inputs:
+ inputs.append(InputParam(name=input_name))
+
+ return inputs
+
+ @property
+ def intermediate_outputs(self) -> List[OutputParam]:
+ return [
+ OutputParam(name="image_height", type_hint=int, description="The height of the image latents"),
+ OutputParam(name="image_width", type_hint=int, description="The width of the image latents"),
+ ]
+
+ def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
+ block_state = self.get_block_state(state)
+
+ # Process image latent inputs (height/width calculation, patchify, and batch expansion)
+ for image_latent_input_name in self._image_latent_inputs:
+ image_latent_tensor = getattr(block_state, image_latent_input_name)
+ if image_latent_tensor is None:
+ continue
+
+ # 1. Calculate height/width from latents
+ height, width = calculate_dimension_from_latents(image_latent_tensor, components.vae_scale_factor)
+ block_state.height = block_state.height or height
+ block_state.width = block_state.width or width
+
+ if not hasattr(block_state, "image_height"):
+ block_state.image_height = height
+ if not hasattr(block_state, "image_width"):
+ block_state.image_width = width
+
+ # 2. Patchify the image latent tensor
+ # TODO: Implement patchifier for Flux.
+ latent_height, latent_width = image_latent_tensor.shape[2:]
+ image_latent_tensor = FluxPipeline._pack_latents(
+ image_latent_tensor, block_state.batch_size, image_latent_tensor.shape[1], latent_height, latent_width
+ )
+
+ # 3. Expand batch size
+ image_latent_tensor = repeat_tensor_to_batch_size(
+ input_name=image_latent_input_name,
+ input_tensor=image_latent_tensor,
+ num_images_per_prompt=block_state.num_images_per_prompt,
+ batch_size=block_state.batch_size,
+ )
+
+ setattr(block_state, image_latent_input_name, image_latent_tensor)
+
+ # Process additional batch inputs (only batch expansion)
+ for input_name in self._additional_batch_inputs:
+ input_tensor = getattr(block_state, input_name)
+ if input_tensor is None:
+ continue
+
+ # Only expand batch size
+ input_tensor = repeat_tensor_to_batch_size(
+ input_name=input_name,
+ input_tensor=input_tensor,
+ num_images_per_prompt=block_state.num_images_per_prompt,
+ batch_size=block_state.batch_size,
+ )
+
+ setattr(block_state, input_name, input_tensor)
+
+ self.set_block_state(state, block_state)
+ return components, state
+
+
+class FluxKontextInputsDynamicStep(FluxInputsDynamicStep):
+ model_name = "flux-kontext"
+
+ def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
+ block_state = self.get_block_state(state)
+
+ # Process image latent inputs (height/width calculation, patchify, and batch expansion)
+ for image_latent_input_name in self._image_latent_inputs:
+ image_latent_tensor = getattr(block_state, image_latent_input_name)
+ if image_latent_tensor is None:
+ continue
+
+ # 1. Calculate height/width from latents
+ # Unlike the `FluxInputsDynamicStep`, we don't overwrite the `block.height` and `block.width`
+ height, width = calculate_dimension_from_latents(image_latent_tensor, components.vae_scale_factor)
+ if not hasattr(block_state, "image_height"):
+ block_state.image_height = height
+ if not hasattr(block_state, "image_width"):
+ block_state.image_width = width
+
+ # 2. Patchify the image latent tensor
+ # TODO: Implement patchifier for Flux.
+ latent_height, latent_width = image_latent_tensor.shape[2:]
+ image_latent_tensor = FluxPipeline._pack_latents(
+ image_latent_tensor, block_state.batch_size, image_latent_tensor.shape[1], latent_height, latent_width
+ )
+
+ # 3. Expand batch size
+ image_latent_tensor = repeat_tensor_to_batch_size(
+ input_name=image_latent_input_name,
+ input_tensor=image_latent_tensor,
+ num_images_per_prompt=block_state.num_images_per_prompt,
+ batch_size=block_state.batch_size,
+ )
+
+ setattr(block_state, image_latent_input_name, image_latent_tensor)
+
+ # Process additional batch inputs (only batch expansion)
+ for input_name in self._additional_batch_inputs:
+ input_tensor = getattr(block_state, input_name)
+ if input_tensor is None:
+ continue
+
+ # Only expand batch size
+ input_tensor = repeat_tensor_to_batch_size(
+ input_name=input_name,
+ input_tensor=input_tensor,
+ num_images_per_prompt=block_state.num_images_per_prompt,
+ batch_size=block_state.batch_size,
+ )
+
+ setattr(block_state, input_name, input_tensor)
+
+ self.set_block_state(state, block_state)
+ return components, state
+
+
+class FluxKontextSetResolutionStep(ModularPipelineBlocks):
+ model_name = "flux-kontext"
+
+ def description(self):
+ return (
+ "Determines the height and width to be used during the subsequent computations.\n"
+ "It should always be placed _before_ the latent preparation step."
+ )
+
+ @property
+ def inputs(self) -> List[InputParam]:
+ inputs = [
+ InputParam(name="height"),
+ InputParam(name="width"),
+ InputParam(name="max_area", type_hint=int, default=1024**2),
+ ]
+ return inputs
+
+ @property
+ def intermediate_outputs(self) -> List[OutputParam]:
+ return [
+ OutputParam(name="height", type_hint=int, description="The height of the initial noisy latents"),
+ OutputParam(name="width", type_hint=int, description="The width of the initial noisy latents"),
+ ]
+
+ @staticmethod
+ def check_inputs(height, width, vae_scale_factor):
+ if height is not None and height % (vae_scale_factor * 2) != 0:
+ raise ValueError(f"Height must be divisible by {vae_scale_factor * 2} but is {height}")
+
+ if width is not None and width % (vae_scale_factor * 2) != 0:
+ raise ValueError(f"Width must be divisible by {vae_scale_factor * 2} but is {width}")
+
+ def __call__(self, components: FluxModularPipeline, state: PipelineState) -> PipelineState:
+ block_state = self.get_block_state(state)
+
+ height = block_state.height or components.default_height
+ width = block_state.width or components.default_width
+ self.check_inputs(height, width, components.vae_scale_factor)
+
+ original_height, original_width = height, width
+ max_area = block_state.max_area
+ aspect_ratio = width / height
+ width = round((max_area * aspect_ratio) ** 0.5)
+ height = round((max_area / aspect_ratio) ** 0.5)
+
+ multiple_of = components.vae_scale_factor * 2
+ width = width // multiple_of * multiple_of
+ height = height // multiple_of * multiple_of
+
+ if height != original_height or width != original_width:
+ logger.warning(
+ f"Generation `height` and `width` have been adjusted to {height} and {width} to fit the model requirements."
+ )
+
+ block_state.height = height
+ block_state.width = width
+
+ self.set_block_state(state, block_state)
+ return components, state
diff --git a/src/diffusers/modular_pipelines/flux/modular_blocks.py b/src/diffusers/modular_pipelines/flux/modular_blocks.py
index 37895bddbf..a80bc2a5f7 100644
--- a/src/diffusers/modular_pipelines/flux/modular_blocks.py
+++ b/src/diffusers/modular_pipelines/flux/modular_blocks.py
@@ -18,21 +18,49 @@ from ..modular_pipeline_utils import InsertableDict
from .before_denoise import (
FluxImg2ImgPrepareLatentsStep,
FluxImg2ImgSetTimestepsStep,
- FluxInputStep,
+ FluxKontextRoPEInputsStep,
FluxPrepareLatentsStep,
+ FluxRoPEInputsStep,
FluxSetTimestepsStep,
)
from .decoders import FluxDecodeStep
-from .denoise import FluxDenoiseStep
-from .encoders import FluxTextEncoderStep, FluxVaeEncoderStep
+from .denoise import FluxDenoiseStep, FluxKontextDenoiseStep
+from .encoders import (
+ FluxKontextProcessImagesInputStep,
+ FluxProcessImagesInputStep,
+ FluxTextEncoderStep,
+ FluxVaeEncoderDynamicStep,
+)
+from .inputs import (
+ FluxInputsDynamicStep,
+ FluxKontextInputsDynamicStep,
+ FluxKontextSetResolutionStep,
+ FluxTextInputStep,
+)
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
# vae encoder (run before before_denoise)
+FluxImg2ImgVaeEncoderBlocks = InsertableDict(
+ [("preprocess", FluxProcessImagesInputStep()), ("encode", FluxVaeEncoderDynamicStep())]
+)
+
+
+class FluxImg2ImgVaeEncoderStep(SequentialPipelineBlocks):
+ model_name = "flux"
+
+ block_classes = FluxImg2ImgVaeEncoderBlocks.values()
+ block_names = FluxImg2ImgVaeEncoderBlocks.keys()
+
+ @property
+ def description(self) -> str:
+ return "Vae encoder step that preprocess andencode the image inputs into their latent representations."
+
+
class FluxAutoVaeEncoderStep(AutoPipelineBlocks):
- block_classes = [FluxVaeEncoderStep]
+ block_classes = [FluxImg2ImgVaeEncoderStep]
block_names = ["img2img"]
block_trigger_inputs = ["image"]
@@ -41,52 +69,89 @@ class FluxAutoVaeEncoderStep(AutoPipelineBlocks):
return (
"Vae encoder step that encode the image inputs into their latent representations.\n"
+ "This is an auto pipeline block that works for img2img tasks.\n"
- + " - `FluxVaeEncoderStep` (img2img) is used when only `image` is provided."
- + " - if `image` is provided, step will be skipped."
+ + " - `FluxImg2ImgVaeEncoderStep` (img2img) is used when only `image` is provided."
+ + " - if `image` is not provided, step will be skipped."
)
-# before_denoise: text2img, img2img
-class FluxBeforeDenoiseStep(SequentialPipelineBlocks):
- block_classes = [
- FluxInputStep,
- FluxPrepareLatentsStep,
- FluxSetTimestepsStep,
- ]
- block_names = ["input", "prepare_latents", "set_timesteps"]
+# Flux Kontext vae encoder (run before before_denoise)
+
+FluxKontextVaeEncoderBlocks = InsertableDict(
+ [("preprocess", FluxKontextProcessImagesInputStep()), ("encode", FluxVaeEncoderDynamicStep(sample_mode="argmax"))]
+)
+
+
+class FluxKontextVaeEncoderStep(SequentialPipelineBlocks):
+ model_name = "flux-kontext"
+
+ block_classes = FluxKontextVaeEncoderBlocks.values()
+ block_names = FluxKontextVaeEncoderBlocks.keys()
+
+ @property
+ def description(self) -> str:
+ return "Vae encoder step that preprocess andencode the image inputs into their latent representations."
+
+
+class FluxKontextAutoVaeEncoderStep(AutoPipelineBlocks):
+ block_classes = [FluxKontextVaeEncoderStep]
+ block_names = ["img2img"]
+ block_trigger_inputs = ["image"]
@property
def description(self):
return (
- "Before denoise step that prepare the inputs for the denoise step.\n"
- + "This is a sequential pipeline blocks:\n"
- + " - `FluxInputStep` is used to adjust the batch size of the model inputs\n"
- + " - `FluxPrepareLatentsStep` is used to prepare the latents\n"
- + " - `FluxSetTimestepsStep` is used to set the timesteps\n"
+ "Vae encoder step that encode the image inputs into their latent representations.\n"
+ + "This is an auto pipeline block that works for img2img tasks.\n"
+ + " - `FluxKontextVaeEncoderStep` (img2img) is used when only `image` is provided."
+ + " - if `image` is not provided, step will be skipped."
)
+# before_denoise: text2img
+FluxBeforeDenoiseBlocks = InsertableDict(
+ [
+ ("prepare_latents", FluxPrepareLatentsStep()),
+ ("set_timesteps", FluxSetTimestepsStep()),
+ ("prepare_rope_inputs", FluxRoPEInputsStep()),
+ ]
+)
+
+
+class FluxBeforeDenoiseStep(SequentialPipelineBlocks):
+ block_classes = FluxBeforeDenoiseBlocks.values()
+ block_names = FluxBeforeDenoiseBlocks.keys()
+
+ @property
+ def description(self):
+ return "Before denoise step that prepares the inputs for the denoise step in text-to-image generation."
+
+
# before_denoise: img2img
+FluxImg2ImgBeforeDenoiseBlocks = InsertableDict(
+ [
+ ("prepare_latents", FluxPrepareLatentsStep()),
+ ("set_timesteps", FluxImg2ImgSetTimestepsStep()),
+ ("prepare_img2img_latents", FluxImg2ImgPrepareLatentsStep()),
+ ("prepare_rope_inputs", FluxRoPEInputsStep()),
+ ]
+)
+
+
class FluxImg2ImgBeforeDenoiseStep(SequentialPipelineBlocks):
- block_classes = [FluxInputStep, FluxImg2ImgSetTimestepsStep, FluxImg2ImgPrepareLatentsStep]
- block_names = ["input", "set_timesteps", "prepare_latents"]
+ block_classes = FluxImg2ImgBeforeDenoiseBlocks.values()
+ block_names = FluxImg2ImgBeforeDenoiseBlocks.keys()
@property
def description(self):
- return (
- "Before denoise step that prepare the inputs for the denoise step for img2img task.\n"
- + "This is a sequential pipeline blocks:\n"
- + " - `FluxInputStep` is used to adjust the batch size of the model inputs\n"
- + " - `FluxImg2ImgSetTimestepsStep` is used to set the timesteps\n"
- + " - `FluxImg2ImgPrepareLatentsStep` is used to prepare the latents\n"
- )
+ return "Before denoise step that prepare the inputs for the denoise step for img2img task."
# before_denoise: all task (text2img, img2img)
class FluxAutoBeforeDenoiseStep(AutoPipelineBlocks):
- block_classes = [FluxBeforeDenoiseStep, FluxImg2ImgBeforeDenoiseStep]
- block_names = ["text2image", "img2img"]
- block_trigger_inputs = [None, "image_latents"]
+ model_name = "flux-kontext"
+ block_classes = [FluxImg2ImgBeforeDenoiseStep, FluxBeforeDenoiseStep]
+ block_names = ["img2img", "text2image"]
+ block_trigger_inputs = ["image_latents", None]
@property
def description(self):
@@ -98,6 +163,44 @@ class FluxAutoBeforeDenoiseStep(AutoPipelineBlocks):
)
+# before_denoise: FluxKontext
+
+FluxKontextBeforeDenoiseBlocks = InsertableDict(
+ [
+ ("prepare_latents", FluxPrepareLatentsStep()),
+ ("set_timesteps", FluxSetTimestepsStep()),
+ ("prepare_rope_inputs", FluxKontextRoPEInputsStep()),
+ ]
+)
+
+
+class FluxKontextBeforeDenoiseStep(SequentialPipelineBlocks):
+ block_classes = FluxKontextBeforeDenoiseBlocks.values()
+ block_names = FluxKontextBeforeDenoiseBlocks.keys()
+
+ @property
+ def description(self):
+ return (
+ "Before denoise step that prepare the inputs for the denoise step\n"
+ "for img2img/text2img task for Flux Kontext."
+ )
+
+
+class FluxKontextAutoBeforeDenoiseStep(AutoPipelineBlocks):
+ block_classes = [FluxKontextBeforeDenoiseStep, FluxBeforeDenoiseStep]
+ block_names = ["img2img", "text2image"]
+ block_trigger_inputs = ["image_latents", None]
+
+ @property
+ def description(self):
+ return (
+ "Before denoise step that prepare the inputs for the denoise step.\n"
+ + "This is an auto pipeline block that works for text2image.\n"
+ + " - `FluxBeforeDenoiseStep` (text2image) is used.\n"
+ + " - `FluxKontextBeforeDenoiseStep` (img2img) is used when only `image_latents` is provided.\n"
+ )
+
+
# denoise: text2image
class FluxAutoDenoiseStep(AutoPipelineBlocks):
block_classes = [FluxDenoiseStep]
@@ -113,7 +216,24 @@ class FluxAutoDenoiseStep(AutoPipelineBlocks):
)
-# decode: all task (text2img, img2img, inpainting)
+# denoise: Flux Kontext
+
+
+class FluxKontextAutoDenoiseStep(AutoPipelineBlocks):
+ block_classes = [FluxKontextDenoiseStep]
+ block_names = ["denoise"]
+ block_trigger_inputs = [None]
+
+ @property
+ def description(self) -> str:
+ return (
+ "Denoise step that iteratively denoise the latents for Flux Kontext. "
+ "This is a auto pipeline block that works for text2image and img2img tasks."
+ " - `FluxDenoiseStep` (denoise) for text2image and img2img tasks."
+ )
+
+
+# decode: all task (text2img, img2img)
class FluxAutoDecodeStep(AutoPipelineBlocks):
block_classes = [FluxDecodeStep]
block_names = ["non-inpaint"]
@@ -124,16 +244,143 @@ class FluxAutoDecodeStep(AutoPipelineBlocks):
return "Decode step that decode the denoised latents into image outputs.\n - `FluxDecodeStep`"
-# text2image
-class FluxAutoBlocks(SequentialPipelineBlocks):
- block_classes = [
- FluxTextEncoderStep,
- FluxAutoVaeEncoderStep,
- FluxAutoBeforeDenoiseStep,
- FluxAutoDenoiseStep,
- FluxAutoDecodeStep,
+# inputs: text2image/img2img
+FluxImg2ImgBlocks = InsertableDict(
+ [("text_inputs", FluxTextInputStep()), ("additional_inputs", FluxInputsDynamicStep())]
+)
+
+
+class FluxImg2ImgInputStep(SequentialPipelineBlocks):
+ model_name = "flux"
+ block_classes = FluxImg2ImgBlocks.values()
+ block_names = FluxImg2ImgBlocks.keys()
+
+ @property
+ def description(self):
+ return "Input step that prepares the inputs for the img2img denoising step. It:\n"
+ " - make sure the text embeddings have consistent batch size as well as the additional inputs (`image_latents`).\n"
+ " - update height/width based `image_latents`, patchify `image_latents`."
+
+
+class FluxAutoInputStep(AutoPipelineBlocks):
+ block_classes = [FluxImg2ImgInputStep, FluxTextInputStep]
+ block_names = ["img2img", "text2image"]
+ block_trigger_inputs = ["image_latents", None]
+
+ @property
+ def description(self):
+ return (
+ "Input step that standardize the inputs for the denoising step, e.g. make sure inputs have consistent batch size, and patchified. \n"
+ " This is an auto pipeline block that works for text2image/img2img tasks.\n"
+ + " - `FluxImg2ImgInputStep` (img2img) is used when `image_latents` is provided.\n"
+ + " - `FluxTextInputStep` (text2image) is used when `image_latents` are not provided.\n"
+ )
+
+
+# inputs: Flux Kontext
+
+FluxKontextBlocks = InsertableDict(
+ [
+ ("set_resolution", FluxKontextSetResolutionStep()),
+ ("text_inputs", FluxTextInputStep()),
+ ("additional_inputs", FluxKontextInputsDynamicStep()),
]
- block_names = ["text_encoder", "image_encoder", "before_denoise", "denoise", "decoder"]
+)
+
+
+class FluxKontextInputStep(SequentialPipelineBlocks):
+ model_name = "flux-kontext"
+ block_classes = FluxKontextBlocks.values()
+ block_names = FluxKontextBlocks.keys()
+
+ @property
+ def description(self):
+ return (
+ "Input step that prepares the inputs for the both text2img and img2img denoising step. It:\n"
+ " - make sure the text embeddings have consistent batch size as well as the additional inputs (`image_latents`).\n"
+ " - update height/width based `image_latents`, patchify `image_latents`."
+ )
+
+
+class FluxKontextAutoInputStep(AutoPipelineBlocks):
+ block_classes = [FluxKontextInputStep, FluxTextInputStep]
+ # block_classes = [FluxKontextInputStep]
+ block_names = ["img2img", "text2img"]
+ # block_names = ["img2img"]
+ block_trigger_inputs = ["image_latents", None]
+ # block_trigger_inputs = ["image_latents"]
+
+ @property
+ def description(self):
+ return (
+ "Input step that standardize the inputs for the denoising step, e.g. make sure inputs have consistent batch size, and patchified. \n"
+ " This is an auto pipeline block that works for text2image/img2img tasks.\n"
+ + " - `FluxKontextInputStep` (img2img) is used when `image_latents` is provided.\n"
+ + " - `FluxKontextInputStep` is also capable of handling text2image task when `image_latent` isn't present."
+ )
+
+
+class FluxCoreDenoiseStep(SequentialPipelineBlocks):
+ model_name = "flux"
+ block_classes = [FluxAutoInputStep, FluxAutoBeforeDenoiseStep, FluxAutoDenoiseStep]
+ block_names = ["input", "before_denoise", "denoise"]
+
+ @property
+ def description(self):
+ return (
+ "Core step that performs the denoising process. \n"
+ + " - `FluxAutoInputStep` (input) standardizes the inputs for the denoising step.\n"
+ + " - `FluxAutoBeforeDenoiseStep` (before_denoise) prepares the inputs for the denoising step.\n"
+ + " - `FluxAutoDenoiseStep` (denoise) iteratively denoises the latents.\n"
+ + "This step supports text-to-image and image-to-image tasks for Flux:\n"
+ + " - for image-to-image generation, you need to provide `image_latents`\n"
+ + " - for text-to-image generation, all you need to provide is prompt embeddings."
+ )
+
+
+class FluxKontextCoreDenoiseStep(SequentialPipelineBlocks):
+ model_name = "flux-kontext"
+ block_classes = [FluxKontextAutoInputStep, FluxKontextAutoBeforeDenoiseStep, FluxKontextAutoDenoiseStep]
+ block_names = ["input", "before_denoise", "denoise"]
+
+ @property
+ def description(self):
+ return (
+ "Core step that performs the denoising process. \n"
+ + " - `FluxKontextAutoInputStep` (input) standardizes the inputs for the denoising step.\n"
+ + " - `FluxKontextAutoBeforeDenoiseStep` (before_denoise) prepares the inputs for the denoising step.\n"
+ + " - `FluxKontextAutoDenoiseStep` (denoise) iteratively denoises the latents.\n"
+ + "This step supports text-to-image and image-to-image tasks for Flux:\n"
+ + " - for image-to-image generation, you need to provide `image_latents`\n"
+ + " - for text-to-image generation, all you need to provide is prompt embeddings."
+ )
+
+
+# Auto blocks (text2image and img2img)
+AUTO_BLOCKS = InsertableDict(
+ [
+ ("text_encoder", FluxTextEncoderStep()),
+ ("image_encoder", FluxAutoVaeEncoderStep()),
+ ("denoise", FluxCoreDenoiseStep()),
+ ("decode", FluxDecodeStep()),
+ ]
+)
+
+AUTO_BLOCKS_KONTEXT = InsertableDict(
+ [
+ ("text_encoder", FluxTextEncoderStep()),
+ ("image_encoder", FluxKontextAutoVaeEncoderStep()),
+ ("denoise", FluxKontextCoreDenoiseStep()),
+ ("decode", FluxDecodeStep()),
+ ]
+)
+
+
+class FluxAutoBlocks(SequentialPipelineBlocks):
+ model_name = "flux"
+
+ block_classes = AUTO_BLOCKS.values()
+ block_names = AUTO_BLOCKS.keys()
@property
def description(self):
@@ -144,38 +391,56 @@ class FluxAutoBlocks(SequentialPipelineBlocks):
)
+class FluxKontextAutoBlocks(FluxAutoBlocks):
+ model_name = "flux-kontext"
+
+ block_classes = AUTO_BLOCKS_KONTEXT.values()
+ block_names = AUTO_BLOCKS_KONTEXT.keys()
+
+
TEXT2IMAGE_BLOCKS = InsertableDict(
[
- ("text_encoder", FluxTextEncoderStep),
- ("input", FluxInputStep),
- ("prepare_latents", FluxPrepareLatentsStep),
- ("set_timesteps", FluxSetTimestepsStep),
- ("denoise", FluxDenoiseStep),
- ("decode", FluxDecodeStep),
+ ("text_encoder", FluxTextEncoderStep()),
+ ("input", FluxTextInputStep()),
+ ("prepare_latents", FluxPrepareLatentsStep()),
+ ("set_timesteps", FluxSetTimestepsStep()),
+ ("prepare_rope_inputs", FluxRoPEInputsStep()),
+ ("denoise", FluxDenoiseStep()),
+ ("decode", FluxDecodeStep()),
]
)
IMAGE2IMAGE_BLOCKS = InsertableDict(
[
- ("text_encoder", FluxTextEncoderStep),
- ("image_encoder", FluxVaeEncoderStep),
- ("input", FluxInputStep),
- ("set_timesteps", FluxImg2ImgSetTimestepsStep),
- ("prepare_latents", FluxImg2ImgPrepareLatentsStep),
- ("denoise", FluxDenoiseStep),
- ("decode", FluxDecodeStep),
+ ("text_encoder", FluxTextEncoderStep()),
+ ("vae_encoder", FluxVaeEncoderDynamicStep()),
+ ("input", FluxImg2ImgInputStep()),
+ ("prepare_latents", FluxPrepareLatentsStep()),
+ ("set_timesteps", FluxImg2ImgSetTimestepsStep()),
+ ("prepare_img2img_latents", FluxImg2ImgPrepareLatentsStep()),
+ ("prepare_rope_inputs", FluxRoPEInputsStep()),
+ ("denoise", FluxDenoiseStep()),
+ ("decode", FluxDecodeStep()),
]
)
-AUTO_BLOCKS = InsertableDict(
+FLUX_KONTEXT_BLOCKS = InsertableDict(
[
- ("text_encoder", FluxTextEncoderStep),
- ("image_encoder", FluxAutoVaeEncoderStep),
- ("before_denoise", FluxAutoBeforeDenoiseStep),
- ("denoise", FluxAutoDenoiseStep),
- ("decode", FluxAutoDecodeStep),
+ ("text_encoder", FluxTextEncoderStep()),
+ ("vae_encoder", FluxVaeEncoderDynamicStep(sample_mode="argmax")),
+ ("input", FluxKontextInputStep()),
+ ("prepare_latents", FluxPrepareLatentsStep()),
+ ("set_timesteps", FluxSetTimestepsStep()),
+ ("prepare_rope_inputs", FluxKontextRoPEInputsStep()),
+ ("denoise", FluxKontextDenoiseStep()),
+ ("decode", FluxDecodeStep()),
]
)
-
-ALL_BLOCKS = {"text2image": TEXT2IMAGE_BLOCKS, "img2img": IMAGE2IMAGE_BLOCKS, "auto": AUTO_BLOCKS}
+ALL_BLOCKS = {
+ "text2image": TEXT2IMAGE_BLOCKS,
+ "img2img": IMAGE2IMAGE_BLOCKS,
+ "auto": AUTO_BLOCKS,
+ "auto_kontext": AUTO_BLOCKS_KONTEXT,
+ "kontext": FLUX_KONTEXT_BLOCKS,
+}
diff --git a/src/diffusers/modular_pipelines/flux/modular_pipeline.py b/src/diffusers/modular_pipelines/flux/modular_pipeline.py
index 563b033343..d8158f5d4f 100644
--- a/src/diffusers/modular_pipelines/flux/modular_pipeline.py
+++ b/src/diffusers/modular_pipelines/flux/modular_pipeline.py
@@ -55,3 +55,13 @@ class FluxModularPipeline(ModularPipeline, FluxLoraLoaderMixin, TextualInversion
if getattr(self, "transformer", None):
num_channels_latents = self.transformer.config.in_channels // 4
return num_channels_latents
+
+
+class FluxKontextModularPipeline(FluxModularPipeline):
+ """
+ A ModularPipeline for Flux Kontext.
+
+ > [!WARNING] > This is an experimental feature and is likely to change in the future.
+ """
+
+ default_blocks_name = "FluxKontextAutoBlocks"
diff --git a/src/diffusers/modular_pipelines/modular_pipeline.py b/src/diffusers/modular_pipelines/modular_pipeline.py
index 037c9e323c..cfbca48a98 100644
--- a/src/diffusers/modular_pipelines/modular_pipeline.py
+++ b/src/diffusers/modular_pipelines/modular_pipeline.py
@@ -57,8 +57,10 @@ MODULAR_PIPELINE_MAPPING = OrderedDict(
("stable-diffusion-xl", "StableDiffusionXLModularPipeline"),
("wan", "WanModularPipeline"),
("flux", "FluxModularPipeline"),
+ ("flux-kontext", "FluxKontextModularPipeline"),
("qwenimage", "QwenImageModularPipeline"),
("qwenimage-edit", "QwenImageEditModularPipeline"),
+ ("qwenimage-edit-plus", "QwenImageEditPlusModularPipeline"),
]
)
@@ -1628,7 +1630,8 @@ class ModularPipeline(ConfigMixin, PushToHubMixin):
blocks = ModularPipelineBlocks.from_pretrained(
pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
)
- except EnvironmentError:
+ except EnvironmentError as e:
+ logger.debug(f"EnvironmentError: {e}")
blocks = None
cache_dir = kwargs.pop("cache_dir", None)
diff --git a/src/diffusers/modular_pipelines/qwenimage/__init__.py b/src/diffusers/modular_pipelines/qwenimage/__init__.py
index 81cf515730..ae4ec4799f 100644
--- a/src/diffusers/modular_pipelines/qwenimage/__init__.py
+++ b/src/diffusers/modular_pipelines/qwenimage/__init__.py
@@ -29,13 +29,20 @@ else:
"EDIT_AUTO_BLOCKS",
"EDIT_BLOCKS",
"EDIT_INPAINT_BLOCKS",
+ "EDIT_PLUS_AUTO_BLOCKS",
+ "EDIT_PLUS_BLOCKS",
"IMAGE2IMAGE_BLOCKS",
"INPAINT_BLOCKS",
"TEXT2IMAGE_BLOCKS",
"QwenImageAutoBlocks",
"QwenImageEditAutoBlocks",
+ "QwenImageEditPlusAutoBlocks",
+ ]
+ _import_structure["modular_pipeline"] = [
+ "QwenImageEditModularPipeline",
+ "QwenImageEditPlusModularPipeline",
+ "QwenImageModularPipeline",
]
- _import_structure["modular_pipeline"] = ["QwenImageEditModularPipeline", "QwenImageModularPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
@@ -54,13 +61,20 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
EDIT_AUTO_BLOCKS,
EDIT_BLOCKS,
EDIT_INPAINT_BLOCKS,
+ EDIT_PLUS_AUTO_BLOCKS,
+ EDIT_PLUS_BLOCKS,
IMAGE2IMAGE_BLOCKS,
INPAINT_BLOCKS,
TEXT2IMAGE_BLOCKS,
QwenImageAutoBlocks,
QwenImageEditAutoBlocks,
+ QwenImageEditPlusAutoBlocks,
+ )
+ from .modular_pipeline import (
+ QwenImageEditModularPipeline,
+ QwenImageEditPlusModularPipeline,
+ QwenImageModularPipeline,
)
- from .modular_pipeline import QwenImageEditModularPipeline, QwenImageModularPipeline
else:
import sys
diff --git a/src/diffusers/modular_pipelines/qwenimage/before_denoise.py b/src/diffusers/modular_pipelines/qwenimage/before_denoise.py
index c370157d9c..298aeea6f2 100644
--- a/src/diffusers/modular_pipelines/qwenimage/before_denoise.py
+++ b/src/diffusers/modular_pipelines/qwenimage/before_denoise.py
@@ -129,7 +129,6 @@ class QwenImagePrepareLatentsStep(ModularPipelineBlocks):
block_state.latents = components.pachifier.pack_latents(block_state.latents)
self.set_block_state(state, block_state)
-
return components, state
@@ -497,7 +496,7 @@ class QwenImageEditRoPEInputsStep(ModularPipelineBlocks):
@property
def description(self) -> str:
- return "Step that prepares the RoPE inputs for denoising process. This is used in QwenImage Edit. Should be place after prepare_latents step"
+ return "Step that prepares the RoPE inputs for denoising process. This is used in QwenImage Edit. Should be placed after prepare_latents step"
@property
def inputs(self) -> List[InputParam]:
diff --git a/src/diffusers/modular_pipelines/qwenimage/encoders.py b/src/diffusers/modular_pipelines/qwenimage/encoders.py
index a1382e84aa..527e58da2f 100644
--- a/src/diffusers/modular_pipelines/qwenimage/encoders.py
+++ b/src/diffusers/modular_pipelines/qwenimage/encoders.py
@@ -129,6 +129,61 @@ def get_qwen_prompt_embeds_edit(
return prompt_embeds, encoder_attention_mask
+def get_qwen_prompt_embeds_edit_plus(
+ text_encoder,
+ processor,
+ prompt: Union[str, List[str]] = None,
+ image: Optional[Union[torch.Tensor, List[PIL.Image.Image], PIL.Image.Image]] = None,
+ prompt_template_encode: str = "<|im_start|>system\nDescribe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+ img_template_encode: str = "Picture {}: <|vision_start|><|image_pad|><|vision_end|>",
+ prompt_template_encode_start_idx: int = 64,
+ device: Optional[torch.device] = None,
+):
+ prompt = [prompt] if isinstance(prompt, str) else prompt
+ if isinstance(image, list):
+ base_img_prompt = ""
+ for i, img in enumerate(image):
+ base_img_prompt += img_template_encode.format(i + 1)
+ elif image is not None:
+ base_img_prompt = img_template_encode.format(1)
+ else:
+ base_img_prompt = ""
+
+ template = prompt_template_encode
+
+ drop_idx = prompt_template_encode_start_idx
+ txt = [template.format(base_img_prompt + e) for e in prompt]
+
+ model_inputs = processor(
+ text=txt,
+ images=image,
+ padding=True,
+ return_tensors="pt",
+ ).to(device)
+ outputs = text_encoder(
+ input_ids=model_inputs.input_ids,
+ attention_mask=model_inputs.attention_mask,
+ pixel_values=model_inputs.pixel_values,
+ image_grid_thw=model_inputs.image_grid_thw,
+ output_hidden_states=True,
+ )
+
+ hidden_states = outputs.hidden_states[-1]
+ split_hidden_states = _extract_masked_hidden(hidden_states, model_inputs.attention_mask)
+ split_hidden_states = [e[drop_idx:] for e in split_hidden_states]
+ attn_mask_list = [torch.ones(e.size(0), dtype=torch.long, device=e.device) for e in split_hidden_states]
+ max_seq_len = max([e.size(0) for e in split_hidden_states])
+ prompt_embeds = torch.stack(
+ [torch.cat([u, u.new_zeros(max_seq_len - u.size(0), u.size(1))]) for u in split_hidden_states]
+ )
+ encoder_attention_mask = torch.stack(
+ [torch.cat([u, u.new_zeros(max_seq_len - u.size(0))]) for u in attn_mask_list]
+ )
+
+ prompt_embeds = prompt_embeds.to(device=device)
+ return prompt_embeds, encoder_attention_mask
+
+
# Modified from diffusers.pipelines.qwenimage.pipeline_qwenimage.QwenImagePipeline._encode_vae_image
def encode_vae_image(
image: torch.Tensor,
@@ -253,6 +308,83 @@ class QwenImageEditResizeDynamicStep(ModularPipelineBlocks):
return components, state
+class QwenImageEditPlusResizeDynamicStep(QwenImageEditResizeDynamicStep):
+ model_name = "qwenimage"
+
+ def __init__(
+ self,
+ input_name: str = "image",
+ output_name: str = "resized_image",
+ vae_image_output_name: str = "vae_image",
+ ):
+ """Create a configurable step for resizing images to the target area (1024 * 1024) while maintaining the aspect ratio.
+
+ This block resizes an input image or a list input images and exposes the resized result under configurable
+ input and output names. Use this when you need to wire the resize step to different image fields (e.g.,
+ "image", "control_image")
+
+ Args:
+ input_name (str, optional): Name of the image field to read from the
+ pipeline state. Defaults to "image".
+ output_name (str, optional): Name of the resized image field to write
+ back to the pipeline state. Defaults to "resized_image".
+ vae_image_output_name (str, optional): Name of the image field
+ to write back to the pipeline state. This is used by the VAE encoder step later on. QwenImage Edit Plus
+ processes the input image(s) differently for the VL and the VAE.
+ """
+ if not isinstance(input_name, str) or not isinstance(output_name, str):
+ raise ValueError(
+ f"input_name and output_name must be strings but are {type(input_name)} and {type(output_name)}"
+ )
+ self.condition_image_size = 384 * 384
+ self._image_input_name = input_name
+ self._resized_image_output_name = output_name
+ self._vae_image_output_name = vae_image_output_name
+ super().__init__()
+
+ @property
+ def intermediate_outputs(self) -> List[OutputParam]:
+ return super().intermediate_outputs + [
+ OutputParam(
+ name=self._vae_image_output_name,
+ type_hint=List[PIL.Image.Image],
+ description="The images to be processed which will be further used by the VAE encoder.",
+ ),
+ ]
+
+ @torch.no_grad()
+ def __call__(self, components: QwenImageModularPipeline, state: PipelineState):
+ block_state = self.get_block_state(state)
+
+ images = getattr(block_state, self._image_input_name)
+
+ if not is_valid_image_imagelist(images):
+ raise ValueError(f"Images must be image or list of images but are {type(images)}")
+
+ if (
+ not isinstance(images, torch.Tensor)
+ and isinstance(images, PIL.Image.Image)
+ and not isinstance(images, list)
+ ):
+ images = [images]
+
+ # TODO (sayakpaul): revisit this when the inputs are `torch.Tensor`s
+ condition_images = []
+ vae_images = []
+ for img in images:
+ image_width, image_height = img.size
+ condition_width, condition_height, _ = calculate_dimensions(
+ self.condition_image_size, image_width / image_height
+ )
+ condition_images.append(components.image_resize_processor.resize(img, condition_height, condition_width))
+ vae_images.append(img)
+
+ setattr(block_state, self._resized_image_output_name, condition_images)
+ setattr(block_state, self._vae_image_output_name, vae_images)
+ self.set_block_state(state, block_state)
+ return components, state
+
+
class QwenImageTextEncoderStep(ModularPipelineBlocks):
model_name = "qwenimage"
@@ -498,6 +630,61 @@ class QwenImageEditTextEncoderStep(ModularPipelineBlocks):
return components, state
+class QwenImageEditPlusTextEncoderStep(QwenImageEditTextEncoderStep):
+ model_name = "qwenimage"
+
+ @property
+ def expected_configs(self) -> List[ConfigSpec]:
+ return [
+ ConfigSpec(
+ name="prompt_template_encode",
+ default="<|im_start|>system\nDescribe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+ ),
+ ConfigSpec(
+ name="img_template_encode",
+ default="Picture {}: <|vision_start|><|image_pad|><|vision_end|>",
+ ),
+ ConfigSpec(name="prompt_template_encode_start_idx", default=64),
+ ]
+
+ @torch.no_grad()
+ def __call__(self, components: QwenImageModularPipeline, state: PipelineState):
+ block_state = self.get_block_state(state)
+
+ self.check_inputs(block_state.prompt, block_state.negative_prompt)
+
+ device = components._execution_device
+
+ block_state.prompt_embeds, block_state.prompt_embeds_mask = get_qwen_prompt_embeds_edit_plus(
+ components.text_encoder,
+ components.processor,
+ prompt=block_state.prompt,
+ image=block_state.resized_image,
+ prompt_template_encode=components.config.prompt_template_encode,
+ img_template_encode=components.config.img_template_encode,
+ prompt_template_encode_start_idx=components.config.prompt_template_encode_start_idx,
+ device=device,
+ )
+
+ if components.requires_unconditional_embeds:
+ negative_prompt = block_state.negative_prompt or " "
+ block_state.negative_prompt_embeds, block_state.negative_prompt_embeds_mask = (
+ get_qwen_prompt_embeds_edit_plus(
+ components.text_encoder,
+ components.processor,
+ prompt=negative_prompt,
+ image=block_state.resized_image,
+ prompt_template_encode=components.config.prompt_template_encode,
+ img_template_encode=components.config.img_template_encode,
+ prompt_template_encode_start_idx=components.config.prompt_template_encode_start_idx,
+ device=device,
+ )
+ )
+
+ self.set_block_state(state, block_state)
+ return components, state
+
+
class QwenImageInpaintProcessImagesInputStep(ModularPipelineBlocks):
model_name = "qwenimage"
@@ -599,12 +786,7 @@ class QwenImageProcessImagesInputStep(ModularPipelineBlocks):
@property
def inputs(self) -> List[InputParam]:
- return [
- InputParam("resized_image"),
- InputParam("image"),
- InputParam("height"),
- InputParam("width"),
- ]
+ return [InputParam("resized_image"), InputParam("image"), InputParam("height"), InputParam("width")]
@property
def intermediate_outputs(self) -> List[OutputParam]:
@@ -648,6 +830,47 @@ class QwenImageProcessImagesInputStep(ModularPipelineBlocks):
return components, state
+class QwenImageEditPlusProcessImagesInputStep(QwenImageProcessImagesInputStep):
+ model_name = "qwenimage-edit-plus"
+ vae_image_size = 1024 * 1024
+
+ @property
+ def description(self) -> str:
+ return "Image Preprocess step for QwenImage Edit Plus. Unlike QwenImage Edit, QwenImage Edit Plus doesn't use the same resized image for further preprocessing."
+
+ @property
+ def inputs(self) -> List[InputParam]:
+ return [InputParam("vae_image"), InputParam("image"), InputParam("height"), InputParam("width")]
+
+ @torch.no_grad()
+ def __call__(self, components: QwenImageModularPipeline, state: PipelineState):
+ block_state = self.get_block_state(state)
+
+ if block_state.vae_image is None and block_state.image is None:
+ raise ValueError("`vae_image` and `image` cannot be None at the same time")
+
+ if block_state.vae_image is None:
+ image = block_state.image
+ self.check_inputs(
+ height=block_state.height, width=block_state.width, vae_scale_factor=components.vae_scale_factor
+ )
+ height = block_state.height or components.default_height
+ width = block_state.width or components.default_width
+ block_state.processed_image = components.image_processor.preprocess(
+ image=image, height=height, width=width
+ )
+ else:
+ width, height = block_state.vae_image[0].size
+ image = block_state.vae_image
+
+ block_state.processed_image = components.image_processor.preprocess(
+ image=image, height=height, width=width
+ )
+
+ self.set_block_state(state, block_state)
+ return components, state
+
+
class QwenImageVaeEncoderDynamicStep(ModularPipelineBlocks):
model_name = "qwenimage"
@@ -725,7 +948,6 @@ class QwenImageVaeEncoderDynamicStep(ModularPipelineBlocks):
dtype=dtype,
latent_channels=components.num_channels_latents,
)
-
setattr(block_state, self._image_latents_output_name, image_latents)
self.set_block_state(state, block_state)
diff --git a/src/diffusers/modular_pipelines/qwenimage/modular_blocks.py b/src/diffusers/modular_pipelines/qwenimage/modular_blocks.py
index 9126766cc2..83bfcb3da4 100644
--- a/src/diffusers/modular_pipelines/qwenimage/modular_blocks.py
+++ b/src/diffusers/modular_pipelines/qwenimage/modular_blocks.py
@@ -37,6 +37,9 @@ from .denoise import (
)
from .encoders import (
QwenImageControlNetVaeEncoderStep,
+ QwenImageEditPlusProcessImagesInputStep,
+ QwenImageEditPlusResizeDynamicStep,
+ QwenImageEditPlusTextEncoderStep,
QwenImageEditResizeDynamicStep,
QwenImageEditTextEncoderStep,
QwenImageInpaintProcessImagesInputStep,
@@ -872,7 +875,151 @@ class QwenImageEditAutoBlocks(SequentialPipelineBlocks):
)
-# 3. all block presets supported in QwenImage & QwenImage-Edit
+#################### QwenImage Edit Plus #####################
+
+# 3. QwenImage-Edit Plus
+
+## 3.1 QwenImage-Edit Plus / edit
+
+#### QwenImage-Edit Plus vl encoder: take both image and text prompts
+QwenImageEditPlusVLEncoderBlocks = InsertableDict(
+ [
+ ("resize", QwenImageEditPlusResizeDynamicStep()),
+ ("encode", QwenImageEditPlusTextEncoderStep()),
+ ]
+)
+
+
+class QwenImageEditPlusVLEncoderStep(SequentialPipelineBlocks):
+ model_name = "qwenimage"
+ block_classes = QwenImageEditPlusVLEncoderBlocks.values()
+ block_names = QwenImageEditPlusVLEncoderBlocks.keys()
+
+ @property
+ def description(self) -> str:
+ return "QwenImage-Edit Plus VL encoder step that encode the image an text prompts together."
+
+
+#### QwenImage-Edit Plus vae encoder
+QwenImageEditPlusVaeEncoderBlocks = InsertableDict(
+ [
+ ("resize", QwenImageEditPlusResizeDynamicStep()), # edit plus has a different resize step
+ ("preprocess", QwenImageEditPlusProcessImagesInputStep()), # vae_image -> processed_image
+ ("encode", QwenImageVaeEncoderDynamicStep()), # processed_image -> image_latents
+ ]
+)
+
+
+class QwenImageEditPlusVaeEncoderStep(SequentialPipelineBlocks):
+ model_name = "qwenimage"
+ block_classes = QwenImageEditPlusVaeEncoderBlocks.values()
+ block_names = QwenImageEditPlusVaeEncoderBlocks.keys()
+
+ @property
+ def description(self) -> str:
+ return "Vae encoder step that encode the image inputs into their latent representations."
+
+
+#### QwenImage Edit Plus presets
+EDIT_PLUS_BLOCKS = InsertableDict(
+ [
+ ("text_encoder", QwenImageEditPlusVLEncoderStep()),
+ ("vae_encoder", QwenImageEditPlusVaeEncoderStep()),
+ ("input", QwenImageEditInputStep()),
+ ("prepare_latents", QwenImagePrepareLatentsStep()),
+ ("set_timesteps", QwenImageSetTimestepsStep()),
+ ("prepare_rope_inputs", QwenImageEditRoPEInputsStep()),
+ ("denoise", QwenImageEditDenoiseStep()),
+ ("decode", QwenImageDecodeStep()),
+ ]
+)
+
+
+# auto before_denoise step for edit tasks
+class QwenImageEditPlusAutoBeforeDenoiseStep(AutoPipelineBlocks):
+ model_name = "qwenimage-edit-plus"
+ block_classes = [QwenImageEditBeforeDenoiseStep]
+ block_names = ["edit"]
+ block_trigger_inputs = ["image_latents"]
+
+ @property
+ def description(self):
+ return (
+ "Before denoise step that prepare the inputs (timesteps, latents, rope inputs etc.) for the denoise step.\n"
+ + "This is an auto pipeline block that works for edit (img2img) task.\n"
+ + " - `QwenImageEditBeforeDenoiseStep` (edit) is used when `image_latents` is provided and `processed_mask_image` is not provided.\n"
+ + " - if `image_latents` is not provided, step will be skipped."
+ )
+
+
+## 3.2 QwenImage-Edit Plus/auto encoders
+
+
+class QwenImageEditPlusAutoVaeEncoderStep(AutoPipelineBlocks):
+ block_classes = [
+ QwenImageEditPlusVaeEncoderStep,
+ ]
+ block_names = ["edit"]
+ block_trigger_inputs = ["image"]
+
+ @property
+ def description(self):
+ return (
+ "Vae encoder step that encode the image inputs into their latent representations. \n"
+ " This is an auto pipeline block that works for edit task.\n"
+ + " - `QwenImageEditPlusVaeEncoderStep` (edit) is used when `image` is provided.\n"
+ + " - if `image` is not provided, step will be skipped."
+ )
+
+
+## 3.3 QwenImage-Edit/auto blocks & presets
+
+
+class QwenImageEditPlusCoreDenoiseStep(SequentialPipelineBlocks):
+ model_name = "qwenimage-edit-plus"
+ block_classes = [
+ QwenImageEditAutoInputStep,
+ QwenImageEditPlusAutoBeforeDenoiseStep,
+ QwenImageEditAutoDenoiseStep,
+ ]
+ block_names = ["input", "before_denoise", "denoise"]
+
+ @property
+ def description(self):
+ return (
+ "Core step that performs the denoising process. \n"
+ + " - `QwenImageEditAutoInputStep` (input) standardizes the inputs for the denoising step.\n"
+ + " - `QwenImageEditPlusAutoBeforeDenoiseStep` (before_denoise) prepares the inputs for the denoising step.\n"
+ + " - `QwenImageEditAutoDenoiseStep` (denoise) iteratively denoises the latents.\n\n"
+ + "This step support edit (img2img) workflow for QwenImage Edit Plus:\n"
+ + " - When `image_latents` is provided, it will be used for edit (img2img) task.\n"
+ )
+
+
+EDIT_PLUS_AUTO_BLOCKS = InsertableDict(
+ [
+ ("text_encoder", QwenImageEditPlusVLEncoderStep()),
+ ("vae_encoder", QwenImageEditPlusAutoVaeEncoderStep()),
+ ("denoise", QwenImageEditPlusCoreDenoiseStep()),
+ ("decode", QwenImageAutoDecodeStep()),
+ ]
+)
+
+
+class QwenImageEditPlusAutoBlocks(SequentialPipelineBlocks):
+ model_name = "qwenimage-edit-plus"
+ block_classes = EDIT_PLUS_AUTO_BLOCKS.values()
+ block_names = EDIT_PLUS_AUTO_BLOCKS.keys()
+
+ @property
+ def description(self):
+ return (
+ "Auto Modular pipeline for edit (img2img) and edit tasks using QwenImage-Edit Plus.\n"
+ + "- for edit (img2img) generation, you need to provide `image`\n"
+ )
+
+
+# 3. all block presets supported in QwenImage, QwenImage-Edit, QwenImage-Edit Plus
ALL_BLOCKS = {
@@ -880,8 +1027,10 @@ ALL_BLOCKS = {
"img2img": IMAGE2IMAGE_BLOCKS,
"edit": EDIT_BLOCKS,
"edit_inpaint": EDIT_INPAINT_BLOCKS,
+ "edit_plus": EDIT_PLUS_BLOCKS,
"inpaint": INPAINT_BLOCKS,
"controlnet": CONTROLNET_BLOCKS,
"auto": AUTO_BLOCKS,
"edit_auto": EDIT_AUTO_BLOCKS,
+ "edit_plus_auto": EDIT_PLUS_AUTO_BLOCKS,
}
diff --git a/src/diffusers/modular_pipelines/qwenimage/modular_pipeline.py b/src/diffusers/modular_pipelines/qwenimage/modular_pipeline.py
index 7200169923..d9e30864f6 100644
--- a/src/diffusers/modular_pipelines/qwenimage/modular_pipeline.py
+++ b/src/diffusers/modular_pipelines/qwenimage/modular_pipeline.py
@@ -196,3 +196,13 @@ class QwenImageEditModularPipeline(ModularPipeline, QwenImageLoraLoaderMixin):
requires_unconditional_embeds = self.guider._enabled and self.guider.num_conditions > 1
return requires_unconditional_embeds
+
+
+class QwenImageEditPlusModularPipeline(QwenImageEditModularPipeline):
+ """
+ A ModularPipeline for QwenImage-Edit Plus.
+
+ > [!WARNING] > This is an experimental feature and is likely to change in the future.
+ """
+
+ default_blocks_name = "QwenImageEditPlusAutoBlocks"
diff --git a/src/diffusers/pipelines/audioldm2/modeling_audioldm2.py b/src/diffusers/pipelines/audioldm2/modeling_audioldm2.py
index 546ae9239a..b6b40cd6e6 100644
--- a/src/diffusers/pipelines/audioldm2/modeling_audioldm2.py
+++ b/src/diffusers/pipelines/audioldm2/modeling_audioldm2.py
@@ -17,7 +17,6 @@ from typing import Any, Dict, List, Optional, Tuple, Union
import torch
import torch.nn as nn
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import UNet2DConditionLoadersMixin
diff --git a/src/diffusers/pipelines/auto_pipeline.py b/src/diffusers/pipelines/auto_pipeline.py
index d265bfdcaf..8a32d4c367 100644
--- a/src/diffusers/pipelines/auto_pipeline.py
+++ b/src/diffusers/pipelines/auto_pipeline.py
@@ -95,6 +95,7 @@ from .qwenimage import (
QwenImageControlNetPipeline,
QwenImageEditInpaintPipeline,
QwenImageEditPipeline,
+ QwenImageEditPlusPipeline,
QwenImageImg2ImgPipeline,
QwenImageInpaintPipeline,
QwenImagePipeline,
@@ -186,6 +187,7 @@ AUTO_IMAGE2IMAGE_PIPELINES_MAPPING = OrderedDict(
("flux-kontext", FluxKontextPipeline),
("qwenimage", QwenImageImg2ImgPipeline),
("qwenimage-edit", QwenImageEditPipeline),
+ ("qwenimage-edit-plus", QwenImageEditPlusPipeline),
]
)
diff --git a/src/diffusers/pipelines/blip_diffusion/modeling_blip2.py b/src/diffusers/pipelines/blip_diffusion/modeling_blip2.py
index 928698e442..b061ac2636 100644
--- a/src/diffusers/pipelines/blip_diffusion/modeling_blip2.py
+++ b/src/diffusers/pipelines/blip_diffusion/modeling_blip2.py
@@ -14,7 +14,6 @@
from typing import Optional, Tuple, Union
import torch
-import torch.utils.checkpoint
from torch import nn
from transformers import BertTokenizer
from transformers.activations import QuickGELUActivation as QuickGELU
diff --git a/src/diffusers/pipelines/cogview4/pipeline_cogview4.py b/src/diffusers/pipelines/cogview4/pipeline_cogview4.py
index 22510f5d9d..546718d57d 100644
--- a/src/diffusers/pipelines/cogview4/pipeline_cogview4.py
+++ b/src/diffusers/pipelines/cogview4/pipeline_cogview4.py
@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-import inspect
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
import numpy as np
@@ -28,6 +27,7 @@ from ...pipelines.pipeline_utils import DiffusionPipeline
from ...schedulers import FlowMatchEulerDiscreteScheduler
from ...utils import is_torch_xla_available, logging, replace_example_docstring
from ...utils.torch_utils import randn_tensor
+from ..pipeline_utils import retrieve_timesteps
from .pipeline_output import CogView4PipelineOutput
@@ -67,73 +67,6 @@ def calculate_shift(
return mu
-def retrieve_timesteps(
- scheduler,
- num_inference_steps: Optional[int] = None,
- device: Optional[Union[str, torch.device]] = None,
- timesteps: Optional[List[int]] = None,
- sigmas: Optional[List[float]] = None,
- **kwargs,
-):
- r"""
- Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
- custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
- Args:
- scheduler (`SchedulerMixin`):
- The scheduler to get timesteps from.
- num_inference_steps (`int`):
- The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
- must be `None`.
- device (`str` or `torch.device`, *optional*):
- The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
- timesteps (`List[int]`, *optional*):
- Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
- `num_inference_steps` and `sigmas` must be `None`.
- sigmas (`List[float]`, *optional*):
- Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
- `num_inference_steps` and `timesteps` must be `None`.
-
- Returns:
- `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
- second element is the number of inference steps.
- """
- accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
- accepts_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
-
- if timesteps is not None and sigmas is not None:
- if not accepts_timesteps and not accepts_sigmas:
- raise ValueError(
- f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
- f" timestep or sigma schedules. Please check whether you are using the correct scheduler."
- )
- scheduler.set_timesteps(timesteps=timesteps, sigmas=sigmas, device=device, **kwargs)
- timesteps = scheduler.timesteps
- num_inference_steps = len(timesteps)
- elif timesteps is not None and sigmas is None:
- if not accepts_timesteps:
- raise ValueError(
- f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
- f" timestep schedules. Please check whether you are using the correct scheduler."
- )
- scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
- timesteps = scheduler.timesteps
- num_inference_steps = len(timesteps)
- elif timesteps is None and sigmas is not None:
- if not accepts_sigmas:
- raise ValueError(
- f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
- f" sigmas schedules. Please check whether you are using the correct scheduler."
- )
- scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
- timesteps = scheduler.timesteps
- num_inference_steps = len(timesteps)
- else:
- scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
- timesteps = scheduler.timesteps
- return timesteps, num_inference_steps
-
-
class CogView4Pipeline(DiffusionPipeline, CogView4LoraLoaderMixin):
r"""
Pipeline for text-to-image generation using CogView4.
diff --git a/src/diffusers/pipelines/cogview4/pipeline_cogview4_control.py b/src/diffusers/pipelines/cogview4/pipeline_cogview4_control.py
index e26b7ba415..33617b3d17 100644
--- a/src/diffusers/pipelines/cogview4/pipeline_cogview4_control.py
+++ b/src/diffusers/pipelines/cogview4/pipeline_cogview4_control.py
@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-import inspect
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
import numpy as np
@@ -27,6 +26,7 @@ from ...pipelines.pipeline_utils import DiffusionPipeline
from ...schedulers import FlowMatchEulerDiscreteScheduler
from ...utils import is_torch_xla_available, logging, replace_example_docstring
from ...utils.torch_utils import randn_tensor
+from ..pipeline_utils import retrieve_timesteps
from .pipeline_output import CogView4PipelineOutput
@@ -68,74 +68,6 @@ def calculate_shift(
return mu
-# Copied from diffusers.pipelines.cogview4.pipeline_cogview4.retrieve_timesteps
-def retrieve_timesteps(
- scheduler,
- num_inference_steps: Optional[int] = None,
- device: Optional[Union[str, torch.device]] = None,
- timesteps: Optional[List[int]] = None,
- sigmas: Optional[List[float]] = None,
- **kwargs,
-):
- r"""
- Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
- custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
- Args:
- scheduler (`SchedulerMixin`):
- The scheduler to get timesteps from.
- num_inference_steps (`int`):
- The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
- must be `None`.
- device (`str` or `torch.device`, *optional*):
- The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
- timesteps (`List[int]`, *optional*):
- Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
- `num_inference_steps` and `sigmas` must be `None`.
- sigmas (`List[float]`, *optional*):
- Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
- `num_inference_steps` and `timesteps` must be `None`.
-
- Returns:
- `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
- second element is the number of inference steps.
- """
- accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
- accepts_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
-
- if timesteps is not None and sigmas is not None:
- if not accepts_timesteps and not accepts_sigmas:
- raise ValueError(
- f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
- f" timestep or sigma schedules. Please check whether you are using the correct scheduler."
- )
- scheduler.set_timesteps(timesteps=timesteps, sigmas=sigmas, device=device, **kwargs)
- timesteps = scheduler.timesteps
- num_inference_steps = len(timesteps)
- elif timesteps is not None and sigmas is None:
- if not accepts_timesteps:
- raise ValueError(
- f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
- f" timestep schedules. Please check whether you are using the correct scheduler."
- )
- scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
- timesteps = scheduler.timesteps
- num_inference_steps = len(timesteps)
- elif timesteps is None and sigmas is not None:
- if not accepts_sigmas:
- raise ValueError(
- f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
- f" sigmas schedules. Please check whether you are using the correct scheduler."
- )
- scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
- timesteps = scheduler.timesteps
- num_inference_steps = len(timesteps)
- else:
- scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
- timesteps = scheduler.timesteps
- return timesteps, num_inference_steps
-
-
class CogView4ControlPipeline(DiffusionPipeline):
r"""
Pipeline for text-to-image generation using CogView4.
diff --git a/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py b/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py
index 2beb0be57b..034a022641 100644
--- a/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py
+++ b/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py
@@ -18,7 +18,6 @@ from typing import Callable, List, Optional, Union
import numpy as np
import PIL.Image
import torch
-import torch.utils.checkpoint
from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
from ....image_processor import VaeImageProcessor
diff --git a/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py b/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py
index adfd899e76..2f54f4fc98 100644
--- a/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py
+++ b/src/diffusers/pipelines/deprecated/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py
@@ -16,7 +16,6 @@ import inspect
from typing import Callable, List, Optional, Union
import torch
-import torch.utils.checkpoint
from transformers import CLIPImageProcessor, CLIPTextModelWithProjection, CLIPTokenizer
from ....image_processor import VaeImageProcessor
diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py
index bc50835d19..f1bf4701e3 100644
--- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py
+++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py
@@ -17,7 +17,6 @@ from typing import List, Optional, Tuple, Union
import torch
import torch.nn as nn
-import torch.utils.checkpoint
from transformers import PretrainedConfig, PreTrainedModel, PreTrainedTokenizer
from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutput
diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py
index 273e97f1ec..631539e5c6 100644
--- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py
+++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py
@@ -4,7 +4,6 @@ from typing import List, Optional, Tuple, Union
import numpy as np
import PIL.Image
import torch
-import torch.utils.checkpoint
from ...models import UNet2DModel, VQModel
from ...schedulers import (
diff --git a/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py b/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py
index b76afb4ec9..a4768093a2 100644
--- a/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py
+++ b/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py
@@ -107,6 +107,38 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
result = torch.lerp(latents, result, factor)
return result
+ def tone_map_latents(self, latents: torch.Tensor, compression: float) -> torch.Tensor:
+ """
+ Applies a non-linear tone-mapping function to latent values to reduce their dynamic range in a perceptually
+ smooth way using a sigmoid-based compression.
+
+ This is useful for regularizing high-variance latents or for conditioning outputs during generation, especially
+ when controlling dynamic behavior with a `compression` factor.
+
+ Args:
+ latents : torch.Tensor
+ Input latent tensor with arbitrary shape. Expected to be roughly in [-1, 1] or [0, 1] range.
+ compression : float
+ Compression strength in the range [0, 1].
+ - 0.0: No tone-mapping (identity transform)
+ - 1.0: Full compression effect
+
+ Returns:
+ torch.Tensor
+ The tone-mapped latent tensor of the same shape as input.
+ """
+ # Remap [0-1] to [0-0.75] and apply sigmoid compression in one shot
+ scale_factor = compression * 0.75
+ abs_latents = torch.abs(latents)
+
+ # Sigmoid compression: sigmoid shifts large values toward 0.2, small values stay ~1.0
+ # When scale_factor=0, sigmoid term vanishes, when scale_factor=0.75, full effect
+ sigmoid_term = torch.sigmoid(4.0 * scale_factor * (abs_latents - 1.0))
+ scales = 1.0 - 0.8 * scale_factor * sigmoid_term
+
+ filtered = latents * scales
+ return filtered
+
@staticmethod
# Copied from diffusers.pipelines.ltx.pipeline_ltx.LTXPipeline._normalize_latents
def _normalize_latents(
@@ -182,7 +214,7 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
)
self.vae.disable_tiling()
- def check_inputs(self, video, height, width, latents):
+ def check_inputs(self, video, height, width, latents, tone_map_compression_ratio):
if height % self.vae_spatial_compression_ratio != 0 or width % self.vae_spatial_compression_ratio != 0:
raise ValueError(f"`height` and `width` have to be divisible by 32 but are {height} and {width}.")
@@ -191,6 +223,9 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
if video is None and latents is None:
raise ValueError("One of `video` or `latents` has to be provided.")
+ if not (0 <= tone_map_compression_ratio <= 1):
+ raise ValueError("`tone_map_compression_ratio` must be in the range [0, 1]")
+
@torch.no_grad()
def __call__(
self,
@@ -201,6 +236,7 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
decode_timestep: Union[float, List[float]] = 0.0,
decode_noise_scale: Optional[Union[float, List[float]]] = None,
adain_factor: float = 0.0,
+ tone_map_compression_ratio: float = 0.0,
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
output_type: Optional[str] = "pil",
return_dict: bool = True,
@@ -210,6 +246,7 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
height=height,
width=width,
latents=latents,
+ tone_map_compression_ratio=tone_map_compression_ratio,
)
if video is not None:
@@ -252,6 +289,9 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
else:
latents = latents_upsampled
+ if tone_map_compression_ratio > 0.0:
+ latents = self.tone_map_latents(latents, tone_map_compression_ratio)
+
if output_type == "latent":
latents = self._normalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
diff --git a/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py b/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py
index da991aefbd..92ec16fd45 100644
--- a/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py
+++ b/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py
@@ -86,15 +86,14 @@ class MarigoldDepthOutput(BaseOutput):
Args:
prediction (`np.ndarray`, `torch.Tensor`):
- Predicted depth maps with values in the range [0, 1]. The shape is $numimages \times 1 \times height \times
- width$ for `torch.Tensor` or $numimages \times height \times width \times 1$ for `np.ndarray`.
+ Predicted depth maps with values in the range [0, 1]. The shape is `numimages × 1 × height × width` for
+ `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`.
uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
- Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages
- \times 1 \times height \times width$ for `torch.Tensor` or $numimages \times height \times width \times 1$
- for `np.ndarray`.
+ Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `numimages × 1 ×
+ height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`.
latent (`None`, `torch.Tensor`):
Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
- The shape is $numimages * numensemble \times 4 \times latentheight \times latentwidth$.
+ The shape is `numimages * numensemble × 4 × latentheight × latentwidth`.
"""
prediction: Union[np.ndarray, torch.Tensor]
diff --git a/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py b/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py
index c809de18f4..bef9ca77c7 100644
--- a/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py
+++ b/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py
@@ -99,17 +99,17 @@ class MarigoldIntrinsicsOutput(BaseOutput):
Args:
prediction (`np.ndarray`, `torch.Tensor`):
- Predicted image intrinsics with values in the range [0, 1]. The shape is $(numimages * numtargets) \times 3
- \times height \times width$ for `torch.Tensor` or $(numimages * numtargets) \times height \times width
- \times 3$ for `np.ndarray`, where `numtargets` corresponds to the number of predicted target modalities of
- the intrinsic image decomposition.
+ Predicted image intrinsics with values in the range [0, 1]. The shape is `(numimages * numtargets) × 3 ×
+ height × width` for `torch.Tensor` or `(numimages * numtargets) × height × width × 3` for `np.ndarray`,
+ where `numtargets` corresponds to the number of predicted target modalities of the intrinsic image
+ decomposition.
uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
- Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $(numimages *
- numtargets) \times 3 \times height \times width$ for `torch.Tensor` or $(numimages * numtargets) \times
- height \times width \times 3$ for `np.ndarray`.
+ Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `(numimages *
+ numtargets) × 3 × height × width` for `torch.Tensor` or `(numimages * numtargets) × height × width × 3` for
+ `np.ndarray`.
latent (`None`, `torch.Tensor`):
Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
- The shape is $(numimages * numensemble) \times (numtargets * 4) \times latentheight \times latentwidth$.
+ The shape is `(numimages * numensemble) × (numtargets * 4) × latentheight × latentwidth`.
"""
prediction: Union[np.ndarray, torch.Tensor]
diff --git a/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py b/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py
index 192ed590a4..485a39c995 100644
--- a/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py
+++ b/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py
@@ -81,15 +81,14 @@ class MarigoldNormalsOutput(BaseOutput):
Args:
prediction (`np.ndarray`, `torch.Tensor`):
- Predicted normals with values in the range [-1, 1]. The shape is $numimages \times 3 \times height \times
- width$ for `torch.Tensor` or $numimages \times height \times width \times 3$ for `np.ndarray`.
+ Predicted normals with values in the range [-1, 1]. The shape is `numimages × 3 × height × width` for
+ `torch.Tensor` or `numimages × height × width × 3` for `np.ndarray`.
uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
- Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages
- \times 1 \times height \times width$ for `torch.Tensor` or $numimages \times height \times width \times 1$
- for `np.ndarray`.
+ Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `numimages × 1 ×
+ height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`.
latent (`None`, `torch.Tensor`):
Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
- The shape is $numimages * numensemble \times 4 \times latentheight \times latentwidth$.
+ The shape is `numimages * numensemble × 4 × latentheight × latentwidth`.
"""
prediction: Union[np.ndarray, torch.Tensor]
diff --git a/src/diffusers/pipelines/pipeline_loading_utils.py b/src/diffusers/pipelines/pipeline_loading_utils.py
index b7a3e08105..dd542145d3 100644
--- a/src/diffusers/pipelines/pipeline_loading_utils.py
+++ b/src/diffusers/pipelines/pipeline_loading_utils.py
@@ -838,6 +838,9 @@ def load_sub_model(
else:
loading_kwargs["low_cpu_mem_usage"] = False
+ if is_transformers_model and is_transformers_version(">=", "4.57.0"):
+ loading_kwargs.pop("offload_state_dict")
+
if (
quantization_config is not None
and isinstance(quantization_config, PipelineQuantizationConfig)
diff --git a/src/diffusers/pipelines/stable_audio/modeling_stable_audio.py b/src/diffusers/pipelines/stable_audio/modeling_stable_audio.py
index 89d4d2dca5..07b382dfc4 100644
--- a/src/diffusers/pipelines/stable_audio/modeling_stable_audio.py
+++ b/src/diffusers/pipelines/stable_audio/modeling_stable_audio.py
@@ -18,7 +18,6 @@ from typing import Optional
import torch
import torch.nn as nn
-import torch.utils.checkpoint
from ...configuration_utils import ConfigMixin, register_to_config
from ...models.modeling_utils import ModelMixin
diff --git a/src/diffusers/pipelines/wan/pipeline_wan_video2video.py b/src/diffusers/pipelines/wan/pipeline_wan_video2video.py
index 4adbd71dac..19350734a7 100644
--- a/src/diffusers/pipelines/wan/pipeline_wan_video2video.py
+++ b/src/diffusers/pipelines/wan/pipeline_wan_video2video.py
@@ -49,7 +49,7 @@ EXAMPLE_DOC_STRING = """
Examples:
```python
>>> import torch
- >>> from diffusers.utils import export_to_video
+ >>> from diffusers.utils import export_to_video, load_video
>>> from diffusers import AutoencoderKLWan, WanVideoToVideoPipeline
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
diff --git a/src/diffusers/utils/dummy_torch_and_transformers_objects.py b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
index bb8fea8c8a..9ed6250452 100644
--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -17,6 +17,36 @@ class FluxAutoBlocks(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"])
+class FluxKontextAutoBlocks(metaclass=DummyObject):
+ _backends = ["torch", "transformers"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch", "transformers"])
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+
+class FluxKontextModularPipeline(metaclass=DummyObject):
+ _backends = ["torch", "transformers"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch", "transformers"])
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+
class FluxModularPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
@@ -77,6 +107,36 @@ class QwenImageEditModularPipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"])
+class QwenImageEditPlusAutoBlocks(metaclass=DummyObject):
+ _backends = ["torch", "transformers"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch", "transformers"])
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+
+class QwenImageEditPlusModularPipeline(metaclass=DummyObject):
+ _backends = ["torch", "transformers"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch", "transformers"])
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+
class QwenImageModularPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
diff --git a/src/diffusers/utils/import_utils.py b/src/diffusers/utils/import_utils.py
index 9399ccd2a7..97065267b0 100644
--- a/src/diffusers/utils/import_utils.py
+++ b/src/diffusers/utils/import_utils.py
@@ -21,6 +21,7 @@ import operator as op
import os
import sys
from collections import OrderedDict, defaultdict
+from functools import lru_cache as cache
from itertools import chain
from types import ModuleType
from typing import Any, Tuple, Union
@@ -673,6 +674,7 @@ def compare_versions(library_or_version: Union[str, Version], operation: str, re
# This function was copied from: https://github.com/huggingface/accelerate/blob/874c4967d94badd24f893064cc3bef45f57cadf7/src/accelerate/utils/versions.py#L338
+@cache
def is_torch_version(operation: str, version: str):
"""
Compares the current PyTorch version to a given reference with an operation.
@@ -686,6 +688,7 @@ def is_torch_version(operation: str, version: str):
return compare_versions(parse(_torch_version), operation, version)
+@cache
def is_torch_xla_version(operation: str, version: str):
"""
Compares the current torch_xla version to a given reference with an operation.
@@ -701,6 +704,7 @@ def is_torch_xla_version(operation: str, version: str):
return compare_versions(parse(_torch_xla_version), operation, version)
+@cache
def is_transformers_version(operation: str, version: str):
"""
Compares the current Transformers version to a given reference with an operation.
@@ -716,6 +720,7 @@ def is_transformers_version(operation: str, version: str):
return compare_versions(parse(_transformers_version), operation, version)
+@cache
def is_hf_hub_version(operation: str, version: str):
"""
Compares the current Hugging Face Hub version to a given reference with an operation.
@@ -731,6 +736,7 @@ def is_hf_hub_version(operation: str, version: str):
return compare_versions(parse(_hf_hub_version), operation, version)
+@cache
def is_accelerate_version(operation: str, version: str):
"""
Compares the current Accelerate version to a given reference with an operation.
@@ -746,6 +752,7 @@ def is_accelerate_version(operation: str, version: str):
return compare_versions(parse(_accelerate_version), operation, version)
+@cache
def is_peft_version(operation: str, version: str):
"""
Compares the current PEFT version to a given reference with an operation.
@@ -761,6 +768,7 @@ def is_peft_version(operation: str, version: str):
return compare_versions(parse(_peft_version), operation, version)
+@cache
def is_bitsandbytes_version(operation: str, version: str):
"""
Args:
@@ -775,6 +783,7 @@ def is_bitsandbytes_version(operation: str, version: str):
return compare_versions(parse(_bitsandbytes_version), operation, version)
+@cache
def is_gguf_version(operation: str, version: str):
"""
Compares the current Accelerate version to a given reference with an operation.
@@ -790,6 +799,7 @@ def is_gguf_version(operation: str, version: str):
return compare_versions(parse(_gguf_version), operation, version)
+@cache
def is_torchao_version(operation: str, version: str):
"""
Compares the current torchao version to a given reference with an operation.
@@ -805,6 +815,7 @@ def is_torchao_version(operation: str, version: str):
return compare_versions(parse(_torchao_version), operation, version)
+@cache
def is_k_diffusion_version(operation: str, version: str):
"""
Compares the current k-diffusion version to a given reference with an operation.
@@ -820,6 +831,7 @@ def is_k_diffusion_version(operation: str, version: str):
return compare_versions(parse(_k_diffusion_version), operation, version)
+@cache
def is_optimum_quanto_version(operation: str, version: str):
"""
Compares the current Accelerate version to a given reference with an operation.
@@ -835,6 +847,7 @@ def is_optimum_quanto_version(operation: str, version: str):
return compare_versions(parse(_optimum_quanto_version), operation, version)
+@cache
def is_nvidia_modelopt_version(operation: str, version: str):
"""
Compares the current Nvidia ModelOpt version to a given reference with an operation.
@@ -850,6 +863,7 @@ def is_nvidia_modelopt_version(operation: str, version: str):
return compare_versions(parse(_nvidia_modelopt_version), operation, version)
+@cache
def is_xformers_version(operation: str, version: str):
"""
Compares the current xformers version to a given reference with an operation.
@@ -865,6 +879,7 @@ def is_xformers_version(operation: str, version: str):
return compare_versions(parse(_xformers_version), operation, version)
+@cache
def is_sageattention_version(operation: str, version: str):
"""
Compares the current sageattention version to a given reference with an operation.
@@ -880,6 +895,7 @@ def is_sageattention_version(operation: str, version: str):
return compare_versions(parse(_sageattention_version), operation, version)
+@cache
def is_flash_attn_version(operation: str, version: str):
"""
Compares the current flash-attention version to a given reference with an operation.
diff --git a/tests/pipelines/audioldm2/test_audioldm2.py b/tests/pipelines/audioldm2/test_audioldm2.py
index e4bc5cc110..14ff1272a2 100644
--- a/tests/pipelines/audioldm2/test_audioldm2.py
+++ b/tests/pipelines/audioldm2/test_audioldm2.py
@@ -138,10 +138,8 @@ class AudioLDM2PipelineFastTests(PipelineTesterMixin, unittest.TestCase):
patch_stride=2,
patch_embed_input_channels=4,
)
- text_encoder_config = ClapConfig.from_text_audio_configs(
- text_config=text_branch_config,
- audio_config=audio_branch_config,
- projection_dim=16,
+ text_encoder_config = ClapConfig(
+ text_config=text_branch_config, audio_config=audio_branch_config, projection_dim=16
)
text_encoder = ClapModel(text_encoder_config)
tokenizer = RobertaTokenizer.from_pretrained("hf-internal-testing/tiny-random-roberta", model_max_length=77)