Skip to content

Commit

Permalink
Merge pull request #137 from huggingface/main
Browse files Browse the repository at this point in the history
Merge changes
  • Loading branch information
Skquark authored Jan 6, 2024
2 parents 17fe861 + 774f5c4 commit fbb26cd
Show file tree
Hide file tree
Showing 60 changed files with 4,108 additions and 318 deletions.
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,8 @@
title: xFormers
- local: optimization/tome
title: Token merging
- local: optimization/deepcache
title: DeepCache
title: General optimizations
- sections:
- local: using-diffusers/stable_diffusion_jax_how_to
Expand Down Expand Up @@ -210,6 +212,8 @@
title: Textual Inversion
- local: api/loaders/unet
title: UNet
- local: api/loaders/peft
title: PEFT
title: Loaders
- sections:
- local: api/models/overview
Expand Down
25 changes: 25 additions & 0 deletions docs/source/en/api/loaders/peft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# PEFT

Diffusers supports working with adapters (such as [LoRA](../../using-diffusers/loading_adapters)) via the [`peft` library](https://huggingface.co/docs/peft/index). We provide a `PeftAdapterMixin` class to handle this for modeling classes in Diffusers (such as [`UNet2DConditionModel`]).

<Tip>

Refer to [this doc](../../tutorials/using_peft_for_inference.md) to get an overview of how to work with `peft` in Diffusers for inference.

</Tip>

## PeftAdapterMixin

[[autodoc]] loaders.peft.PeftAdapterMixin
2 changes: 1 addition & 1 deletion docs/source/en/api/pipelines/amused.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

# aMUSEd

aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2302.05543) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen.
aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2401.01808) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen.

Amused is a lightweight text to image model based off of the [MUSE](https://arxiv.org/abs/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.

Expand Down
62 changes: 62 additions & 0 deletions docs/source/en/optimization/deepcache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# DeepCache
[DeepCache](https://huggingface.co/papers/2312.00858) accelerates [`StableDiffusionPipeline`] and [`StableDiffusionXLPipeline`] by strategically caching and reusing high-level features while efficiently updating low-level features by taking advantage of the U-Net architecture.

Start by installing [DeepCache](https://github.com/horseee/DeepCache):
```bash
pip install DeepCache
```

Then load and enable the [`DeepCacheSDHelper`](https://github.com/horseee/DeepCache#usage):

```diff
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained('runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16).to("cuda")

+ from DeepCache import DeepCacheSDHelper
+ helper = DeepCacheSDHelper(pipe=pipe)
+ helper.set_params(
+ cache_interval=3,
+ cache_branch_id=0,
+ )
+ helper.enable()

image = pipe("a photo of an astronaut on a moon").images[0]
```

The `set_params` method accepts two arguments: `cache_interval` and `cache_branch_id`. `cache_interval` means the frequency of feature caching, specified as the number of steps between each cache operation. `cache_branch_id` identifies which branch of the network (ordered from the shallowest to the deepest layer) is responsible for executing the caching processes.
Opting for a lower `cache_branch_id` or a larger `cache_interval` can lead to faster inference speed at the expense of reduced image quality (ablation experiments of these two hyperparameters can be found in the [paper](https://arxiv.org/abs/2312.00858)). Once those arguments are set, use the `enable` or `disable` methods to activate or deactivate the `DeepCacheSDHelper`.

<div class="flex justify-center">
<img src="https://github.com/horseee/Diffusion_DeepCache/raw/master/static/images/example.png">
</div>

You can find more generated samples (original pipeline vs DeepCache) and the corresponding inference latency in the [WandB report](https://wandb.ai/horseee/DeepCache/runs/jwlsqqgt?workspace=user-horseee). The prompts are randomly selected from the [MS-COCO 2017](https://cocodataset.org/#home) dataset.

## Benchmark

We tested how much faster DeepCache accelerates [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with 50 inference steps on an NVIDIA RTX A5000, using different configurations for resolution, batch size, cache interval (I), and cache branch (B).

| **Resolution** | **Batch size** | **Original** | **DeepCache(I=3, B=0)** | **DeepCache(I=5, B=0)** | **DeepCache(I=5, B=1)** |
|----------------|----------------|--------------|-------------------------|-------------------------|-------------------------|
| 512| 8| 15.96| 6.88(2.32x)| 5.03(3.18x)| 7.27(2.20x)|
| | 4| 8.39| 3.60(2.33x)| 2.62(3.21x)| 3.75(2.24x)|
| | 1| 2.61| 1.12(2.33x)| 0.81(3.24x)| 1.11(2.35x)|
| 768| 8| 43.58| 18.99(2.29x)| 13.96(3.12x)| 21.27(2.05x)|
| | 4| 22.24| 9.67(2.30x)| 7.10(3.13x)| 10.74(2.07x)|
| | 1| 6.33| 2.72(2.33x)| 1.97(3.21x)| 2.98(2.12x)|
| 1024| 8| 101.95| 45.57(2.24x)| 33.72(3.02x)| 53.00(1.92x)|
| | 4| 49.25| 21.86(2.25x)| 16.19(3.04x)| 25.78(1.91x)|
| | 1| 13.83| 6.07(2.28x)| 4.43(3.12x)| 7.15(1.93x)|
14 changes: 7 additions & 7 deletions docs/source/en/tutorials/fast_diffusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ specific language governing permissions and limitations under the License.

Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:

* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora.md))
* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora))
* model compression (such as [SSD-1B](https://huggingface.co/segmind/SSD-1B))
* reusing adjacent features of the denoiser (such as [DeepCache](https://github.com/horseee/DeepCache))

In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl.md) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.

## Setup

Make sure you're on the latest version of `diffusers`:
Make sure you're on the latest version of `diffusers`:

```bash
pip install -U diffusers
Expand All @@ -42,7 +42,7 @@ _This tutorial doesn't present the benchmarking code and focuses on how to perfo

## Baseline

Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0.md):
Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0):

```python
from diffusers import StableDiffusionXLPipeline
Expand Down Expand Up @@ -104,11 +104,11 @@ _(We later ran the experiments in float16 and found out that the recent versions
* The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16.
* Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.

We have a [dedicated guide](../optimization/fp16.md) for running inference in a reduced precision.
We have a [dedicated guide](../optimization/fp16) for running inference in a reduced precision.

## Running attention efficiently

Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0.md), we can run them efficiently.
Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0), we can run them efficiently.

```python
from diffusers import StableDiffusionXLPipeline
Expand Down Expand Up @@ -200,7 +200,7 @@ It provides a minor boost from 2.54 seconds to 2.52 seconds.

<Tip warning={true}>

Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky.md). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.

</Tip>

Expand Down
7 changes: 1 addition & 6 deletions docs/source/en/using-diffusers/svd.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,6 @@ frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
```

<video controls width="1024" height="576">
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket_generated.webm" type="video/webm" />
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket_generated.mp4" type="video/mp4" />
</video>

| **Source Image** | **Video** |
|:------------:|:-----:|
| ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png) | ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif) |
Expand Down Expand Up @@ -86,7 +81,7 @@ You can achieve a 20-25% speed-up at the expense of slightly increased memory by
Video generation is very memory intensive as we have to essentially generate `num_frames` all at once. The mechanism is very comparable to text-to-image generation with a high batch size. To reduce the memory requirement you have multiple options. The following options trade inference speed against lower memory requirement:
- enable model offloading: Each component of the pipeline is offloaded to CPU once it's not needed anymore.
- enable feed-forward chunking: The feed-forward layer runs in a loop instead of running with a single huge feed-forward batch size
- reduce `decode_chunk_size`: This means that the VAE decodes frames in chunks instead of decoding them all together. **Note**: In addition to leading to a small slowdown, this method also slightly leads to video quality deterioration
- reduce `decode_chunk_size`: This means that the VAE decodes frames in chunks instead of decoding them all together. **Note that**, in addition to leading to a small slowdown, this method also slightly leads to video quality deterioration.

You can enable them as follows:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@


# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.25.0.dev0")
check_min_version("0.26.0.dev0")

logger = get_logger(__name__)

Expand Down Expand Up @@ -1316,13 +1316,15 @@ def save_model_hook(models, weights, output_dir):
if isinstance(model, type(accelerator.unwrap_model(unet))):
unet_lora_layers_to_save = convert_state_dict_to_diffusers(get_peft_model_state_dict(model))
elif isinstance(model, type(accelerator.unwrap_model(text_encoder_one))):
text_encoder_one_lora_layers_to_save = convert_state_dict_to_diffusers(
get_peft_model_state_dict(model)
)
if args.train_text_encoder:
text_encoder_one_lora_layers_to_save = convert_state_dict_to_diffusers(
get_peft_model_state_dict(model)
)
elif isinstance(model, type(accelerator.unwrap_model(text_encoder_two))):
text_encoder_two_lora_layers_to_save = convert_state_dict_to_diffusers(
get_peft_model_state_dict(model)
)
if args.train_text_encoder:
text_encoder_two_lora_layers_to_save = convert_state_dict_to_diffusers(
get_peft_model_state_dict(model)
)
else:
raise ValueError(f"unexpected save model: {model.__class__}")

Expand All @@ -1335,6 +1337,8 @@ def save_model_hook(models, weights, output_dir):
text_encoder_lora_layers=text_encoder_one_lora_layers_to_save,
text_encoder_2_lora_layers=text_encoder_two_lora_layers_to_save,
)
if args.train_text_encoder_ti:
embedding_handler.save_embeddings(f"{output_dir}/{args.output_dir}_emb.safetensors")

def load_model_hook(models, input_dir):
unet_ = None
Expand Down
Loading

0 comments on commit fbb26cd

Please sign in to comment.