1
0
mirror of https://github.com/huggingface/diffusers.git synced 2026-01-27 17:22:53 +03:00
Files
diffusers/docs/source/ko/using-diffusers/svd.md
Tolga Cangรถz 7071b7461b Errata: Fix typos & \s+$ (#9008)
* Fix typos

* chore: Fix typos

* chore: Update README.md for promptdiffusion example

* Trim trailing white spaces

* Fix a typo

* update number

* chore: update number

* Trim trailing white space

* Update README.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update README.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2024-08-02 21:24:25 -07:00

122 lines
6.2 KiB
Markdown

<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable Video Diffusion
[[open-in-colab]]
[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127)์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€์— ๋งž์ถฐ 2~4์ดˆ ๋ถ„๋Ÿ‰์˜ ๊ณ ํ•ด์ƒ๋„(576x1024) ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ image-to-video ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” SVD๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์—์„œ ์งง์€ ๋™์˜์ƒ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:
```py
!pip install -q -U diffusers transformers accelerate
```
์ด ๋ชจ๋ธ์—๋Š” [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid)์™€ [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. SVD ์ฒดํฌํฌ์ธํŠธ๋Š” 14๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ๊ณ , SVD-XT ์ฒดํฌํฌ์ธํŠธ๋Š” 25๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๋„๋ก ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” SVD-XT ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
```python
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()
# Conditioning ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
```
<div class="flex gap-4">
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">"source image of a rocket"</figcaption>
</div>
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">"generated video from source image"</figcaption>
</div>
</div>
## torch.compile
UNet์„ [์ปดํŒŒ์ผ](../optimization/torch2.0#torchcompile)ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์‚ด์ง ์ฆ๊ฐ€ํ•˜์ง€๋งŒ, 20~25%์˜ ์†๋„ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```diff
- pipe.enable_model_cpu_offload()
+ pipe.to("cuda")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
```
## ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ค„์ด๊ธฐ
๋น„๋””์˜ค ์ƒ์„ฑ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ํฐ text-to-image ์ƒ์„ฑ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ 'num_frames'๋ฅผ ํ•œ ๋ฒˆ์— ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋งค์šฐ ๋†’์Šต๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ถ”๋ก  ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ ˆ์ถฉํ•˜๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์˜ต์…˜์ด ์žˆ์Šต๋‹ˆ๋‹ค:
- ๋ชจ๋ธ ์˜คํ”„๋กœ๋ง ํ™œ์„ฑํ™”: ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ๋” ์ด์ƒ ํ•„์š”ํ•˜์ง€ ์•Š์„ ๋•Œ CPU๋กœ ์˜คํ”„๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.
- Feed-forward chunking ํ™œ์„ฑํ™”: feed-forward ๋ ˆ์ด์–ด๊ฐ€ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ํฐ ๋‹จ์ผ feed-forward๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋Œ€์‹  ๋ฃจํ”„๋กœ ๋ฐ˜๋ณตํ•ด์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.
- `decode_chunk_size` ๊ฐ์†Œ: VAE๊ฐ€ ํ”„๋ ˆ์ž„๋“ค์„ ํ•œ๊บผ๋ฒˆ์— ๋””์ฝ”๋”ฉํ•˜๋Š” ๋Œ€์‹  chunk ๋‹จ์œ„๋กœ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. `decode_chunk_size=1`์„ ์„ค์ •ํ•˜๋ฉด ํ•œ ๋ฒˆ์— ํ•œ ํ”„๋ ˆ์ž„์”ฉ ๋””์ฝ”๋”ฉํ•˜๊ณ  ์ตœ์†Œํ•œ์˜ ๋ฉ”๋ชจ๋ฆฌ๋งŒ ์‚ฌ์šฉํ•˜์ง€๋งŒ(GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋”ฐ๋ผ ์ด ๊ฐ’์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค), ๋™์˜์ƒ์— ์•ฝ๊ฐ„์˜ ๊นœ๋ฐ•์ž„์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```diff
- pipe.enable_model_cpu_offload()
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
+ pipe.enable_model_cpu_offload()
+ pipe.unet.enable_forward_chunking()
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]
```
์ด๋Ÿฌํ•œ ๋ชจ๋“  ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด 8GAM VRAM๋ณด๋‹ค ์ ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
## Micro-conditioning
Stable Diffusion Video๋Š” ๋˜ํ•œ ์ด๋ฏธ์ง€ conditoning ์™ธ์—๋„ micro-conditioning์„ ํ—ˆ์šฉํ•˜๋ฏ€๋กœ ์ƒ์„ฑ๋œ ๋น„๋””์˜ค๋ฅผ ๋” ์ž˜ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
- `fps`: ์ƒ์„ฑ๋œ ๋น„๋””์˜ค์˜ ์ดˆ๋‹น ํ”„๋ ˆ์ž„ ์ˆ˜์ž…๋‹ˆ๋‹ค.
- `motion_bucket_id`: ์ƒ์„ฑ๋œ ๋™์˜์ƒ์— ์‚ฌ์šฉํ•  ๋ชจ์…˜ ๋ฒ„ํ‚ท ์•„์ด๋””์ž…๋‹ˆ๋‹ค. ์ƒ์„ฑ๋œ ๋™์˜์ƒ์˜ ๋ชจ์…˜์„ ์ œ์–ดํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ์…˜ ๋ฒ„ํ‚ท ์•„์ด๋””๋ฅผ ๋Š˜๋ฆฌ๋ฉด ์ƒ์„ฑ๋˜๋Š” ๋™์˜์ƒ์˜ ๋ชจ์…˜์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
- `noise_aug_strength`: Conditioning ์ด๋ฏธ์ง€์— ์ถ”๊ฐ€๋˜๋Š” ๋…ธ์ด์ฆˆ์˜ ์–‘์ž…๋‹ˆ๋‹ค. ๊ฐ’์ด ํด์ˆ˜๋ก ๋น„๋””์˜ค๊ฐ€ conditioning ์ด๋ฏธ์ง€์™€ ๋œ ์œ ์‚ฌํ•ด์ง‘๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ๋†’์ด๋ฉด ์ƒ์„ฑ๋œ ๋น„๋””์˜ค์˜ ์›€์ง์ž„๋„ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ๋ชจ์…˜์ด ๋” ๋งŽ์€ ๋™์˜์ƒ์„ ์ƒ์„ฑํ•˜๋ ค๋ฉด `motion_bucket_id` ๋ฐ `noise_aug_strength` micro-conditioning ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
```python
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()
# Conditioning ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
```
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket_with_conditions.gif)