* find & replace all FloatTensors to Tensor
* apply formatting
* Update torch.FloatTensor to torch.Tensor in the remaining files
* formatting
* Fix the rest of the places where FloatTensor is used as well as in documentation
* formatting
* Update new file from FloatTensor to Tensor
* Add Ascend NPU support for SDXL fine-tuning and fix the model saving bug when using DeepSpeed.
* fix check code quality
* Decouple the NPU flash attention and make it an independent module.
* add doc and unit tests for npu flash attention.
---------
Co-authored-by: mhh001 <mahonghao1@huawei.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* chore: reducing model sizes
* chore: shrinks further
* chore: shrinks further
* chore: shrinking model for img2img pipeline
* chore: reducing size of model for inpaint pipeline
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* give it a shot.
* print.
* correct assertion.
* gather results from the rest of the tests.
* change the assertion values where needed.
* remove print statements.
* get device <-> component mapping when using multiple gpus.
* condition the device_map bits.
* relax condition
* device_map progress.
* device_map enhancement
* some cleaning up and debugging
* Apply suggestions from code review
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* incorporate suggestions from PR.
* remove multi-gpu condition for now.
* guard check the component -> device mapping
* fix: device_memory variable
* dispatching transformers model to have force_hooks=True
* better guarding for transformers device_map
* introduce support balanced_low_memory and balanced_ultra_low_memory.
* remove device_map patch.
* fix: intermediate variable scoping.
* fix: condition in cpu offload.
* fix: flax class restrictions.
* remove modifications from cpu_offload and model_offload
* incorporate changes.
* add a simple forward pass test
* add: torch_device in get_inputs()
* add: tests
* remove print
* safe-guard to(), model offloading and cpu offloading when balanced is used as a device_map.
* style
* remove .
* safeguard device_map with more checks and remove invalid device_mapping strategues.
* make a class attribute and adjust tests accordingly.
* fix device_map check
* fix test
* adjust comment
* fix: device_map attribute
* fix: dispatching.
* max_memory test for pipeline
* version guard the tests
* fix guard.
* address review feedback.
* reset_device_map method.
* add: test for reset_hf_device_map
* fix a couple things.
* add reset_device_map() in the error message.
* add tests for checking reset_device_map doesn't have unintended consequences.
* fix reset_device_map and offloading tests.
* create _get_final_device_map utility.
* hf_device_map -> _hf_device_map
* add documentation
* add notes suggested by Marc.
* styling.
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* move updates within gpu condition.
* other docs related things
* note on ignore a device not specified in .
* provide a suggestion if device mapping errors out.
* fix: typo.
* _hf_device_map -> hf_device_map
* Empty-Commit
* add: example hf_device_map.
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* reduce block sizes for unet1d.
* reduce blocks for unet_2d.
* reduce block size for unet_motion
* increase channels.
* correctly increase channels.
* reduce number of layers in unet2dconditionmodel tests.
* reduce block sizes for unet2dconditionmodel tests
* reduce block sizes for unet3dconditionmodel.
* fix: test_feed_forward_chunking
* fix: test_forward_with_norm_groups
* skip spatiotemporal tests on MPS.
* reduce block size in AutoencoderKL.
* reduce block sizes for vqmodel.
* further reduce block size.
* make style.
* Empty-Commit
* reduce sizes for ConsistencyDecoderVAETests
* further reduction.
* further block reductions in AutoencoderKL and AssymetricAutoencoderKL.
* massively reduce the block size in unet2dcontionmodel.
* reduce sizes for unet3d
* fix tests in unet3d.
* reduce blocks further in motion unet.
* fix: output shape
* add attention_head_dim to the test configuration.
* remove unexpected keyword arg
* up a bit.
* groups.
* up again
* fix
* Skip `test_freeu_enabled ` on MPS
* Small fixes
- import skip_mps correctly
- disable all instances of test_freeu_enabled
* Empty commit to trigger tests
* Empty commit to trigger CI
* UniPC Multistep add `rescale_betas_zero_snr`
Same patch as DPM and Euler with the patched final alpha cumprod
BF16 doesn't seem to break down, I think cause UniPC upcasts during some
phases already? We could still force an upcast since it only
loses β 0.005 it/s for me but the difference in output is very small. A
better endeavor might upcasting in step() and removing all the other
upcasts elsewhere?
* UniPC ZSNR UT
* Re-add `rescale_betas_zsnr` doc oops
* UniPC UTs iterate solvers on FP16
It wasn't catching errs on order==3. Might be excessive?
* UniPC Multistep fix tensor dtype/device on order=3
* UniPC UTs Add v_pred to fp16 test iter
For completions sake. Probably overkill?
* start printing the tensors.
* print full throttle
* set static slices for 7 tests.
* remove printing.
* flatten
* disable test for controlnet
* what happens when things are seeded properly?
* set the right value
* style./
* make pia test fail to check things
* print.
* fix pia.
* checking for animatediff.
* fix: animatediff.
* video synthesis
* final piece.
* style.
* print guess.
* fix: assertion for control guess.
---------
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
* Add `final_sigma_zero` to UniPCMultistep
Effectively the same trick as DDIM's `set_alpha_to_one` and
DPM's `final_sigma_type='zero'`.
Currently False by default but maybe this should be True?
* `final_sigma_zero: bool` -> `final_sigmas_type: str`
Should 1:1 match DPM Multistep now.
* Set `final_sigmas_type='sigma_min'` in UniPC UTs
* Initial commit
* Implemented block lora
- implemented block lora
- updated docs
- added tests
* Finishing up
* Reverted unrelated changes made by make style
* Fixed typo
* Fixed bug + Made text_encoder_2 scalable
* Integrated some review feedback
* Incorporated review feedback
* Fix tests
* Made every module configurable
* Adapter to new lora test structure
* Final cleanup
* Some more final fixes
- Included examples in `using_peft_for_inference.md`
- Added hint that only attns are scaled
- Removed NoneTypes
- Added test to check mismatching lens of adapter names / weights raise error
* Update using_peft_for_inference.md
* Update using_peft_for_inference.md
* Make style, quality, fix-copies
* Updated tutorial;Warning if scale/adapter mismatch
* floats are forwarded as-is; changed tutorial scale
* make style, quality, fix-copies
* Fixed typo in tutorial
* Moved some warnings into `lora_loader_utils.py`
* Moved scale/lora mismatch warnings back
* Integrated final review suggestions
* Empty commit to trigger CI
* Reverted emoty commit to trigger CI
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>