From e417d028115e72b953a73e39d9687aa70ba3e37e Mon Sep 17 00:00:00 2001 From: Aryan Date: Fri, 30 Aug 2024 13:53:25 +0530 Subject: [PATCH] [docs] Add a note on torchao/quanto benchmarks for CogVideoX and memory-efficient inference (#9296) * add a note on torchao/quanto benchmarks and memory-efficient inference * apply suggestions from review * update * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * add note on enable sequential cpu offload --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/api/pipelines/cogvideox.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/source/en/api/pipelines/cogvideox.md b/docs/source/en/api/pipelines/cogvideox.md index c7340eff40..4254246fee 100644 --- a/docs/source/en/api/pipelines/cogvideox.md +++ b/docs/source/en/api/pipelines/cogvideox.md @@ -77,10 +77,21 @@ CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds o - `pipe.enable_model_cpu_offload()`: - Without enabling cpu offloading, memory usage is `33 GB` - With enabling cpu offloading, memory usage is `19 GB` +- `pipe.enable_sequential_cpu_offload()`: + - Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference + - When enabled, memory usage is under `4 GB` - `pipe.vae.enable_tiling()`: - With enabling cpu offloading and tiling, memory usage is `11 GB` - `pipe.vae.enable_slicing()` +### Quantized inference + +[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs! + +It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below. +- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897) +- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa) + ## CogVideoXPipeline [[autodoc]] CogVideoXPipeline