mirror of
https://github.com/huggingface/diffusers.git
synced 2026-01-29 07:22:12 +03:00
70 lines
2.8 KiB
Markdown
70 lines
2.8 KiB
Markdown
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
-->
|
|
|
|
# GGUF
|
|
|
|
The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.
|
|
|
|
The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant.
|
|
|
|
Before starting please install gguf in your environment
|
|
|
|
```shell
|
|
pip install -U gguf
|
|
```
|
|
|
|
Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`].
|
|
|
|
When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.
|
|
|
|
The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade).
|
|
|
|
```python
|
|
import torch
|
|
|
|
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
|
|
|
|
ckpt_path = (
|
|
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
|
|
)
|
|
transformer = FluxTransformer2DModel.from_single_file(
|
|
ckpt_path,
|
|
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
|
|
torch_dtype=torch.bfloat16,
|
|
)
|
|
pipe = FluxPipeline.from_pretrained(
|
|
"black-forest-labs/FLUX.1-dev",
|
|
transformer=transformer,
|
|
torch_dtype=torch.bfloat16,
|
|
)
|
|
pipe.enable_model_cpu_offload()
|
|
prompt = "A cat holding a sign that says hello world"
|
|
image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
|
|
image.save("flux-gguf.png")
|
|
```
|
|
|
|
## Supported Quantization Types
|
|
|
|
- BF16
|
|
- Q4_0
|
|
- Q4_1
|
|
- Q5_0
|
|
- Q5_1
|
|
- Q8_0
|
|
- Q2_K
|
|
- Q3_K
|
|
- Q4_K
|
|
- Q5_K
|
|
- Q6_K
|
|
|