From edef2da4e4cc74eb205c71896666ff83aef38e44 Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Tue, 17 Jun 2025 16:30:24 +0530 Subject: [PATCH] add a tip for compile + offload --- docs/source/en/optimization/memory.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md index 6b853a7a08..2d34961b8e 100644 --- a/docs/source/en/optimization/memory.md +++ b/docs/source/en/optimization/memory.md @@ -302,6 +302,13 @@ compute-bound, [group-offloading](#group-offloading) tends to be better. Group o + + +When using offloading, users can additionally compile the diffusion transformer/unet to get a +good speed-memory trade-off. First set `torch._dynamo.config.cache_size_limit=1000`, and then before calling the pipeline, add `pipeline.transformer.compile()`. + + + ## Layerwise casting Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.