Merge pull request #137 from huggingface/main

Merge changes
Skquark · Jan 6, 2024 · fbb26cd · fbb26cd
2 parents 17fe861 + 774f5c4
commit fbb26cd
Show file tree

Hide file tree

Showing 60 changed files with 4,108 additions and 318 deletions.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -160,6 +160,8 @@
       title: xFormers
     - local: optimization/tome
       title: Token merging
+    - local: optimization/deepcache
+      title: DeepCache
     title: General optimizations
   - sections:
     - local: using-diffusers/stable_diffusion_jax_how_to
@@ -210,6 +212,8 @@
       title: Textual Inversion
     - local: api/loaders/unet
       title: UNet
+    - local: api/loaders/peft
+      title: PEFT
     title: Loaders
   - sections:
     - local: api/models/overview

diff --git a/docs/source/en/api/loaders/peft.md b/docs/source/en/api/loaders/peft.md
@@ -0,0 +1,25 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# PEFT
+
+Diffusers supports working with adapters (such as [LoRA](../../using-diffusers/loading_adapters)) via the [`peft` library](https://huggingface.co/docs/peft/index). We provide a `PeftAdapterMixin` class to handle this for modeling classes in Diffusers (such as [`UNet2DConditionModel`]).
+
+<Tip>
+
+Refer to [this doc](../../tutorials/using_peft_for_inference.md) to get an overview of how to work with `peft` in Diffusers for inference.
+
+</Tip>
+
+## PeftAdapterMixin
+
+[[autodoc]] loaders.peft.PeftAdapterMixin
diff --git a/docs/source/en/api/pipelines/amused.md b/docs/source/en/api/pipelines/amused.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
 
 # aMUSEd
 
-aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2302.05543) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen.
+aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2401.01808) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen.
 
 Amused is a lightweight text to image model based off of the [MUSE](https://arxiv.org/abs/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
 

diff --git a/docs/source/en/optimization/deepcache.md b/docs/source/en/optimization/deepcache.md
@@ -0,0 +1,62 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# DeepCache
+[DeepCache](https://huggingface.co/papers/2312.00858) accelerates [`StableDiffusionPipeline`] and [`StableDiffusionXLPipeline`] by strategically caching and reusing high-level features while efficiently updating low-level features by taking advantage of the U-Net architecture.
+
+Start by installing [DeepCache](https://github.com/horseee/DeepCache):
+```bash
+pip install DeepCache
+```
+
+Then load and enable the [`DeepCacheSDHelper`](https://github.com/horseee/DeepCache#usage):
+
+```diff
+  import torch
+  from diffusers import StableDiffusionPipeline
+  pipe = StableDiffusionPipeline.from_pretrained('runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16).to("cuda")
+
++ from DeepCache import DeepCacheSDHelper
++ helper = DeepCacheSDHelper(pipe=pipe)
++ helper.set_params(
++     cache_interval=3,
++     cache_branch_id=0,
++ )
++ helper.enable()
+
+  image = pipe("a photo of an astronaut on a moon").images[0]
+```
+
+The `set_params` method accepts two arguments: `cache_interval` and `cache_branch_id`. `cache_interval` means the frequency of feature caching, specified as the number of steps between each cache operation. `cache_branch_id` identifies which branch of the network (ordered from the shallowest to the deepest layer) is responsible for executing the caching processes. 
+Opting for a lower `cache_branch_id` or a larger `cache_interval` can lead to faster inference speed at the expense of reduced image quality (ablation experiments of these two hyperparameters can be found in the [paper](https://arxiv.org/abs/2312.00858)). Once those arguments are set, use the `enable` or `disable` methods to activate or deactivate the `DeepCacheSDHelper`.
+
+<div class="flex justify-center">
+    <img src="https://github.com/horseee/Diffusion_DeepCache/raw/master/static/images/example.png">
+</div>
+
+You can find more generated samples (original pipeline vs DeepCache) and the corresponding inference latency in the [WandB report](https://wandb.ai/horseee/DeepCache/runs/jwlsqqgt?workspace=user-horseee). The prompts are randomly selected from the [MS-COCO 2017](https://cocodataset.org/#home) dataset.
+
+## Benchmark
+
+We tested how much faster DeepCache accelerates [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with 50 inference steps on an NVIDIA RTX A5000, using different configurations for resolution, batch size, cache interval (I), and cache branch (B).
+
+| **Resolution** | **Batch size** | **Original** | **DeepCache(I=3, B=0)** | **DeepCache(I=5, B=0)** | **DeepCache(I=5, B=1)** |
+|----------------|----------------|--------------|-------------------------|-------------------------|-------------------------|
+|             512|               8|         15.96|              6.88(2.32x)|              5.03(3.18x)|              7.27(2.20x)|
+|                |               4|          8.39|              3.60(2.33x)|              2.62(3.21x)|              3.75(2.24x)|
+|                |               1|          2.61|              1.12(2.33x)|              0.81(3.24x)|              1.11(2.35x)|
+|             768|               8|         43.58|             18.99(2.29x)|             13.96(3.12x)|             21.27(2.05x)|
+|                |               4|         22.24|              9.67(2.30x)|              7.10(3.13x)|             10.74(2.07x)|
+|                |               1|          6.33|              2.72(2.33x)|              1.97(3.21x)|              2.98(2.12x)|
+|            1024|               8|        101.95|             45.57(2.24x)|             33.72(3.02x)|             53.00(1.92x)|
+|                |               4|         49.25|             21.86(2.25x)|             16.19(3.04x)|             25.78(1.91x)|
+|                |               1|         13.83|              6.07(2.28x)|              4.43(3.12x)|              7.15(1.93x)|
diff --git a/docs/source/en/tutorials/fast_diffusion.md b/docs/source/en/tutorials/fast_diffusion.md
@@ -14,15 +14,15 @@ specific language governing permissions and limitations under the License.
 
 Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:
 
-* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora.md))
+* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora))
 * model compression (such as [SSD-1B](https://huggingface.co/segmind/SSD-1B))
 * reusing adjacent features of the denoiser (such as [DeepCache](https://github.com/horseee/DeepCache))
 
-In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl.md) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
+In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
 
 ## Setup
 
-Make sure you're on the latest version of `diffusers`: 
+Make sure you're on the latest version of `diffusers`:
 
 ```bash
 pip install -U diffusers
@@ -42,7 +42,7 @@ _This tutorial doesn't present the benchmarking code and focuses on how to perfo
 
 ## Baseline
 
-Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0.md):
+Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0):
 
 ```python
 from diffusers import StableDiffusionXLPipeline
@@ -104,11 +104,11 @@ _(We later ran the experiments in float16 and found out that the recent versions
 * The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16. 
 * Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.  
 
-We have a [dedicated guide](../optimization/fp16.md) for running inference in a reduced precision. 
+We have a [dedicated guide](../optimization/fp16) for running inference in a reduced precision. 
 
 ## Running attention efficiently
 
-Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0.md), we can run them efficiently. 
+Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0), we can run them efficiently. 
 
 ```python
 from diffusers import StableDiffusionXLPipeline
@@ -200,7 +200,7 @@ It provides a minor boost from 2.54 seconds to 2.52 seconds.
 
 <Tip warning={true}>
 
-Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky.md). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
+Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
 
 </Tip>
 

diff --git a/docs/source/en/using-diffusers/svd.md b/docs/source/en/using-diffusers/svd.md
@@ -53,11 +53,6 @@ frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
 export_to_video(frames, "generated.mp4", fps=7)
 ```
 
-<video controls width="1024" height="576">
-  <source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket_generated.webm" type="video/webm" />
-  <source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket_generated.mp4" type="video/mp4" />
-</video>
-
 | **Source Image** | **Video** |
 |:------------:|:-----:|
 |     ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png)      |  ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif)  |
@@ -86,7 +81,7 @@ You can achieve a 20-25% speed-up at the expense of slightly increased memory by
 Video generation is very memory intensive as we have to essentially generate `num_frames` all at once. The mechanism is very comparable to text-to-image generation with a high batch size. To reduce the memory requirement you have multiple options. The following options trade inference speed against lower memory requirement:
 - enable model offloading: Each component of the pipeline is offloaded to CPU once it's not needed anymore.
 - enable feed-forward chunking: The feed-forward layer runs in a loop instead of running with a single huge feed-forward batch size
-- reduce `decode_chunk_size`: This means that the VAE decodes frames in chunks instead of decoding them all together. **Note**: In addition to leading to a small slowdown, this method also slightly leads to video quality deterioration
+- reduce `decode_chunk_size`: This means that the VAE decodes frames in chunks instead of decoding them all together. **Note that**, in addition to leading to a small slowdown, this method also slightly leads to video quality deterioration.
 
 You can enable them as follows:
 

diff --git a/examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py b/examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py
@@ -70,7 +70,7 @@
 
 
 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.25.0.dev0")
+check_min_version("0.26.0.dev0")
 
 logger = get_logger(__name__)
 
@@ -1316,13 +1316,15 @@ def save_model_hook(models, weights, output_dir):
                 if isinstance(model, type(accelerator.unwrap_model(unet))):
                     unet_lora_layers_to_save = convert_state_dict_to_diffusers(get_peft_model_state_dict(model))
                 elif isinstance(model, type(accelerator.unwrap_model(text_encoder_one))):
-                    text_encoder_one_lora_layers_to_save = convert_state_dict_to_diffusers(
-                        get_peft_model_state_dict(model)
-                    )
+                    if args.train_text_encoder:
+                        text_encoder_one_lora_layers_to_save = convert_state_dict_to_diffusers(
+                            get_peft_model_state_dict(model)
+                        )
                 elif isinstance(model, type(accelerator.unwrap_model(text_encoder_two))):
-                    text_encoder_two_lora_layers_to_save = convert_state_dict_to_diffusers(
-                        get_peft_model_state_dict(model)
-                    )
+                    if args.train_text_encoder:
+                        text_encoder_two_lora_layers_to_save = convert_state_dict_to_diffusers(
+                            get_peft_model_state_dict(model)
+                        )
                 else:
                     raise ValueError(f"unexpected save model: {model.__class__}")
 
@@ -1335,6 +1337,8 @@ def save_model_hook(models, weights, output_dir):
                 text_encoder_lora_layers=text_encoder_one_lora_layers_to_save,
                 text_encoder_2_lora_layers=text_encoder_two_lora_layers_to_save,
             )
+        if args.train_text_encoder_ti:
+            embedding_handler.save_embeddings(f"{output_dir}/{args.output_dir}_emb.safetensors")
 
     def load_model_hook(models, input_dir):
         unet_ = None