* up
* convert dual unet
* revert dual attn
* adapt for vd-official
* test the full pipeline
* mixed inference
* mixed inference for text2img
* add image prompting
* fix clip norm
* split text2img and img2img
* fix format
* refactor text2img
* mega pipeline
* add optimus
* refactor image var
* wip text_unet
* text unet end to end
* update tests
* reshape
* fix image to text
* add some first docs
* dual guided pipeline
* fix token ratio
* propose change
* dual transformer as a native module
* DualTransformer(nn.Module)
* DualTransformer(nn.Module)
* correct unconditional image
* save-load with mega pipeline
* remove image to text
* up
* uP
* fix
* up
* final fix
* remove_unused_weights
* test updates
* save progress
* uP
* fix dual prompts
* some fixes
* finish
* style
* finish renaming
* up
* fix
* fix
* fix
* finish
Co-authored-by: anton-l <anton@huggingface.co>
* add conversion script for vae
* up
* up
* some fixes
* add text model
* use the correct config
* add docs
* move model in it's own file
* move model in its own file
* pass attenion mask to text encoder
* pass attn mask to uncond inputs
* quality
* fix image2image
* add imag2image in init
* fix import
* fix one more import
* fix import, dummy objetcs
* fix copied from
* up
* finish
Co-authored-by: patil-suraj <surajp815@gmail.com>
* Changes for VQ-diffusion VQVAE
Add specify dimension of embeddings to VQModel:
`VQModel` will by default set the dimension of embeddings to the number
of latent channels. The VQ-diffusion VQVAE has a smaller
embedding dimension, 128, than number of latent channels, 256.
Add AttnDownEncoderBlock2D and AttnUpDecoderBlock2D to the up and down
unet block helpers. VQ-diffusion's VQVAE uses those two block types.
* Changes for VQ-diffusion transformer
Modify attention.py so SpatialTransformer can be used for
VQ-diffusion's transformer.
SpatialTransformer:
- Can now operate over discrete inputs (classes of vector embeddings) as well as continuous.
- `in_channels` was made optional in the constructor so two locations where it was passed as a positional arg were moved to kwargs
- modified forward pass to take optional timestep embeddings
ImagePositionalEmbeddings:
- added to provide positional embeddings to discrete inputs for latent pixels
BasicTransformerBlock:
- norm layers were made configurable so that the VQ-diffusion could use AdaLayerNorm with timestep embeddings
- modified forward pass to take optional timestep embeddings
CrossAttention:
- now may optionally take a bias parameter for its query, key, and value linear layers
FeedForward:
- Internal layers are now configurable
ApproximateGELU:
- Activation function in VQ-diffusion's feedforward layer
AdaLayerNorm:
- Norm layer modified to incorporate timestep embeddings
* Add VQ-diffusion scheduler
* Add VQ-diffusion pipeline
* Add VQ-diffusion convert script to diffusers
* Add VQ-diffusion dummy objects
* Add VQ-diffusion markdown docs
* Add VQ-diffusion tests
* some renaming
* some fixes
* more renaming
* correct
* fix typo
* correct weights
* finalize
* fix tests
* Apply suggestions from code review
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
* Apply suggestions from code review
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* finish
* finish
* up
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>