mirror of https://github.com/huggingface/diffusers.git synced 2026-01-29 07:22:12 +03:00

Files

Kashif Rasul dad5cb55e6 Fix QwenImage txt_seq_lens handling (#12702 )

* Fix QwenImage txt_seq_lens handling

* formatting

* formatting

* remove txt_seq_lens and use bool  mask

* use compute_text_seq_len_from_mask

* add seq_lens to dispatch_attention_fn

* use joint_seq_lens

* remove unused index_block

* WIP: Remove seq_lens parameter and use mask-based approach

- Remove seq_lens parameter from dispatch_attention_fn
- Update varlen backends to extract seqlens from masks
- Update QwenImage to pass 2D joint_attention_mask
- Fix native backend to handle 2D boolean masks
- Fix sage_varlen seqlens_q to match seqlens_k for self-attention

Note: sage_varlen still producing black images, needs further investigation

* fix formatting

* undo sage changes

* xformers support

* hub fix

* fix torch compile issues

* fix tests

* use _prepare_attn_mask_native

* proper deprecation notice

* add deprecate to txt_seq_lens

* Update src/diffusers/models/transformers/transformer_qwenimage.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* Update src/diffusers/models/transformers/transformer_qwenimage.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* Only create the mask if there's actual padding

* fix order of docstrings

* Adds performance benchmarks and optimization details for QwenImage

Enhances documentation with comprehensive performance insights for QwenImage pipeline:

* rope_text_seq_len = text_seq_len

* rename to max_txt_seq_len

* removed deprecated args

* undo unrelated change

* Updates QwenImage performance documentation

Removes detailed attention backend benchmarks and simplifies torch.compile performance description

Focuses on key performance improvement with torch.compile, highlighting the specific speedup from 4.70s to 1.93s on an A100 GPU

Streamlines the documentation to provide more concise and actionable performance insights

* Updates deprecation warnings for txt_seq_lens parameter

Extends deprecation timeline for txt_seq_lens from version 0.37.0 to 0.39.0 across multiple Qwen image-related models

Adds a new unit test to verify the deprecation warning behavior for the txt_seq_lens parameter

* fix compile

* formatting

* fix compile tests

* rename helper

* remove duplicate

* smaller values

* removed

* use torch.cond for torch compile

* Construct joint attention mask once

* test different backends

* construct joint attention mask once to avoid reconstructing in every block

* Update src/diffusers/models/attention_dispatch.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* formatting

* raising an error from the EditPlus pipeline when batch_size > 1

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: cdutr <dutra_carlos@hotmail.com>

2026-01-12 13:45:09 +05:30

6.2 KiB

Raw Permalink Blame History

QwenImage

Qwen-Image from the Qwen team is an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. Experiments show strong general capabilities in both image generation and editing, with exceptional performance in text rendering, especially for Chinese.

Qwen-Image comes in the following variants:

model type	model id
Qwen-Image	`Qwen/Qwen-Image`
Qwen-Image-Edit	`Qwen/Qwen-Image-Edit`
Qwen-Image-Edit Plus	Qwen/Qwen-Image-Edit-2509

Tip

Caching may also speed up inference by storing and reusing intermediate outputs.

LoRA for faster inference

Use a LoRA from lightx2v/Qwen-Image-Lightning to speed up inference by reducing the number of steps. Refer to the code snippet below:

Code

from diffusers import DiffusionPipeline, FlowMatchEulerDiscreteScheduler
import torch 
import math

ckpt_id = "Qwen/Qwen-Image"

# From
# https://github.com/ModelTC/Qwen-Image-Lightning/blob/342260e8f5468d2f24d084ce04f55e101007118b/generate_with_diffusers.py#L82C9-L97C10
scheduler_config = {
    "base_image_seq_len": 256,
    "base_shift": math.log(3),  # We use shift=3 in distillation
    "invert_sigmas": False,
    "max_image_seq_len": 8192,
    "max_shift": math.log(3),  # We use shift=3 in distillation
    "num_train_timesteps": 1000,
    "shift": 1.0,
    "shift_terminal": None,  # set shift_terminal to None
    "stochastic_sampling": False,
    "time_shift_type": "exponential",
    "use_beta_sigmas": False,
    "use_dynamic_shifting": True,
    "use_exponential_sigmas": False,
    "use_karras_sigmas": False,
}
scheduler = FlowMatchEulerDiscreteScheduler.from_config(scheduler_config)
pipe = DiffusionPipeline.from_pretrained(
    ckpt_id, scheduler=scheduler, torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
    "lightx2v/Qwen-Image-Lightning", weight_name="Qwen-Image-Lightning-8steps-V1.0.safetensors"
)

prompt = "a tiny astronaut hatching from an egg on the moon, Ultra HD, 4K, cinematic composition."
negative_prompt = " "
image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=8,
    true_cfg_scale=1.0,
    generator=torch.manual_seed(0),
).images[0]
image.save("qwen_fewsteps.png")

Tip

The guidance_scale parameter in the pipeline is there to support future guidance-distilled models when they come up. Note that passing guidance_scale to the pipeline is ineffective. To enable classifier-free guidance, please pass true_cfg_scale and negative_prompt (even an empty negative prompt like " ") should enable classifier-free guidance computations.

Multi-image reference with QwenImageEditPlusPipeline

With [QwenImageEditPlusPipeline], one can provide multiple images as input reference.

import torch
from PIL import Image
from diffusers import QwenImageEditPlusPipeline
from diffusers.utils import load_image

pipe = QwenImageEditPlusPipeline.from_pretrained(
    "Qwen/Qwen-Image-Edit-2509", torch_dtype=torch.bfloat16
).to("cuda")

image_1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/grumpy.jpg")
image_2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peng.png")
image = pipe(
    image=[image_1, image_2],
    prompt='''put the penguin and the cat at a game show called "Qwen Edit Plus Games"''',
    num_inference_steps=50
).images[0]

Performance

torch.compile

Using torch.compile on the transformer provides ~2.4x speedup (A100 80GB: 4.70s → 1.93s):

import torch
from diffusers import QwenImagePipeline

pipe = QwenImagePipeline.from_pretrained("Qwen/Qwen-Image", torch_dtype=torch.bfloat16).to("cuda")
pipe.transformer = torch.compile(pipe.transformer)

# First call triggers compilation (~7s overhead)
# Subsequent calls run at ~2.4x faster
image = pipe("a cat", num_inference_steps=50).images[0]

Batched Inference with Variable-Length Prompts

When using classifier-free guidance (CFG) with prompts of different lengths, the pipeline properly handles padding through attention masking. This ensures padding tokens do not influence the generated output.

# CFG with different prompt lengths works correctly
image = pipe(
    prompt="A cat",
    negative_prompt="blurry, low quality, distorted",
    true_cfg_scale=3.5,
    num_inference_steps=50,
).images[0]

For detailed benchmark scripts and results, see this gist.

QwenImagePipeline

autodoc QwenImagePipeline

all
call

QwenImageImg2ImgPipeline

autodoc QwenImageImg2ImgPipeline

all
call

QwenImageInpaintPipeline

autodoc QwenImageInpaintPipeline

all
call

QwenImageEditPipeline

autodoc QwenImageEditPipeline

all
call

QwenImageEditInpaintPipeline

autodoc QwenImageEditInpaintPipeline

all
call

QwenImageControlNetPipeline

autodoc QwenImageControlNetPipeline

all
call

QwenImageEditPlusPipeline

autodoc QwenImageEditPlusPipeline

all
call

QwenImagePipelineOutput

autodoc pipelines.qwenimage.pipeline_output.QwenImagePipelineOutput

6.2 KiB Raw Permalink Blame History