From cf1ca728eabb8354ce5be57cf4d97d503a01dbb9 Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Mon, 25 Aug 2025 21:12:06 +0530 Subject: [PATCH] fix title for compile + offload quantized models (#12233) * up * up * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/_toctree.yml | 2 +- docs/source/en/optimization/speed-memory-optims.md | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 42558b636c..fccec0a080 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -77,7 +77,7 @@ - local: optimization/memory title: Reduce memory usage - local: optimization/speed-memory-optims - title: Compile and offloading quantized models + title: Compiling and offloading quantized models - title: Community optimizations sections: - local: optimization/pruna diff --git a/docs/source/en/optimization/speed-memory-optims.md b/docs/source/en/optimization/speed-memory-optims.md index f43e60bc74..80c6c79a3c 100644 --- a/docs/source/en/optimization/speed-memory-optims.md +++ b/docs/source/en/optimization/speed-memory-optims.md @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Compile and offloading quantized models +# Compiling and offloading quantized models Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading). @@ -28,7 +28,8 @@ The table below provides a comparison of optimization strategy combinations and | quantization | 32.602 | 14.9453 | | quantization, torch.compile | 25.847 | 14.9448 | | quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 | -These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) if you're interested in evaluating your own model. + +These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the benchmarking script if you're interested in evaluating your own model. This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes.