From abc47dece65f9e1db040549f6f02516ac4718dea Mon Sep 17 00:00:00 2001 From: Patrick von Platen Date: Fri, 15 Sep 2023 12:51:36 +0200 Subject: [PATCH] [SDXL, Docs] Textual inversion (#5039) * [SDXL, Docs] Textual inversion * Update docs/source/en/using-diffusers/sdxl.md * finish * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/using-diffusers/sdxl.md | 4 +- .../textual_inversion_inference.md | 49 +++++++++++++++++++ 2 files changed, 52 insertions(+), 1 deletion(-) diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md index 557ef628a4..ebfee0b8e0 100644 --- a/docs/source/en/using-diffusers/sdxl.md +++ b/docs/source/en/using-diffusers/sdxl.md @@ -397,6 +397,8 @@ image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] generated image of an astronaut in a jungle in the style of a van gogh painting +The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section. + ## Optimizations SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference. @@ -426,4 +428,4 @@ SDXL is a large model, and you may need to optimize memory to get it to run on y ## Other resources -If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🤗 Diffusers. \ No newline at end of file +If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🤗 Diffusers. diff --git a/docs/source/en/using-diffusers/textual_inversion_inference.md b/docs/source/en/using-diffusers/textual_inversion_inference.md index 6771343fc5..0ca4ecc58d 100644 --- a/docs/source/en/using-diffusers/textual_inversion_inference.md +++ b/docs/source/en/using-diffusers/textual_inversion_inference.md @@ -28,6 +28,8 @@ from diffusers.utils import make_image_grid from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer ``` +## Stable Diffusion 1 and 2 + Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer): ```py @@ -69,3 +71,50 @@ grid
+ + +## Stable Diffusion XL + +Stable Diffusion XL (SDXL) can also use textual inversion vectors for inference. In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you'll need two textual inversion embeddings - one for each text encoder model. + +Let's download the SDXL textual inversion embeddings and have a closer look at it's structure: + +```py +from huggingface_hub import hf_hub_download +from safetensors.torch import load_file + +file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors") +state_dict = load_file(file) +state_dict +``` + +``` +{'clip_g': tensor([[ 0.0077, -0.0112, 0.0065, ..., 0.0195, 0.0159, 0.0275], + ..., + [-0.0170, 0.0213, 0.0143, ..., -0.0302, -0.0240, -0.0362]], + 'clip_l': tensor([[ 0.0023, 0.0192, 0.0213, ..., -0.0385, 0.0048, -0.0011], + ..., + [ 0.0475, -0.0508, -0.0145, ..., 0.0070, -0.0089, -0.0163]], +``` + +There are two tensors, `"clip-g"` and `"clip-l"`. +`"clip-g"` corresponds to the bigger text encoder in SDXL and refers to +`pipe.text_encoder_2` and `"clip-l"` refers to `pipe.text_encoder`. + +Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer +to [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`]: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16) +pipe.to("cuda") + +pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2) +pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer) + +# the embedding should be used as a negative embedding, so we pass it as a negative prompt +generator = torch.Generator().manual_seed(33) +image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0] +```