Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge changes #182

Merged
merged 35 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
14a1b86
Several fixes to Flux ControlNet pipelines (#9472)
vladmandic Sep 20, 2024
e5d0a32
[refactor] LoRA tests (#9481)
a-r-r-o-w Sep 21, 2024
aa73072
[CI] fix nightly model tests (#9483)
sayakpaul Sep 21, 2024
ba5af5a
[Cog] some minor fixes and nits (#9466)
sayakpaul Sep 23, 2024
14f6464
[Tests] Reduce the model size in the lumina test (#8985)
saqlain2204 Sep 23, 2024
00f5b41
Fix the bug of sd3 controlnet training when using gradient checkpoint…
pibbo88 Sep 23, 2024
65f9439
[Schedulers] Add exponential sigmas / exponential noise schedule (#9499)
hlky Sep 23, 2024
3e69e24
Allow DDPMPipeline half precision (#9222)
sbinnee Sep 23, 2024
19547a5
Add Noise Schedule/Schedule Type to Schedulers Overview documentation…
hlky Sep 23, 2024
bab1778
fix bugs for sd3 controlnet training (#9489)
xduzhangjiayu Sep 23, 2024
2b5bc5b
[Doc] Fix path and and also import imageio (#9506)
LukeLIN-web Sep 23, 2024
28f9d84
[CI] allow faster downloads from the Hub in CI. (#9478)
sayakpaul Sep 24, 2024
bac8a24
a few fix for SingleFile tests (#9522)
yiyixuxu Sep 24, 2024
b52684c
Add exponential sigmas to other schedulers and update docs (#9518)
hlky Sep 25, 2024
6ca5a58
[Community Pipeline] Batched implementation of Flux with CFG (#9513)
sayakpaul Sep 25, 2024
065ce07
Update community_projects.md (#9266)
lee101 Sep 25, 2024
d9c9691
[docs] Model sharding (#9521)
stevhliu Sep 25, 2024
c76e884
update get_parameter_dtype (#9526)
yiyixuxu Sep 25, 2024
aa3c46d
[Doc] Improved level of clarity for latents_to_rgb. (#9529)
LagPixelLOL Sep 25, 2024
1c6ede9
[Schedulers] Add beta sigmas / beta noise schedule (#9509)
hlky Sep 25, 2024
9cd3755
flux controlnet fix (control_modes batch & others) (#9507)
yiyixuxu Sep 26, 2024
066ea37
[Tests] Fix ChatGLMTokenizer (#9536)
asomoza Sep 26, 2024
665c6b4
[bug] Precedence of operations in VAE should be slicing -> tiling (#9…
a-r-r-o-w Sep 26, 2024
2daedc0
[LoRA] make set_adapters() method more robust. (#9535)
sayakpaul Sep 27, 2024
534848c
[examples] add train flux-controlnet scripts in example. (#9324)
PromeAIpro Sep 27, 2024
81cf3b2
[Tests] [LoRA] clean up the serialization stuff. (#9512)
sayakpaul Sep 27, 2024
1154243
[Core] fix variant-identification. (#9253)
sayakpaul Sep 28, 2024
bd4df28
[refactor] remove conv_cache from CogVideoX VAE (#9524)
a-r-r-o-w Sep 28, 2024
b28675c
[train_instruct_pix2pix.py]Fix the LR schedulers when `num_train_epoc…
AnandK27 Sep 28, 2024
8e7d6c0
[chore] fix: retain memory utility. (#9543)
sayakpaul Sep 28, 2024
f9fd511
[LoRA] support Kohya Flux LoRAs that have text encoders as well (#9542)
sayakpaul Sep 30, 2024
c4a8979
Add beta sigmas to other schedulers and update docs (#9538)
hlky Sep 30, 2024
33fafe3
Add PAG support to StableDiffusionControlNetPAGInpaintPipeline (#8875)
juancopi81 Oct 1, 2024
61d3764
Support bfloat16 for Upsample2D (#9480)
darhsu Oct 2, 2024
7f323f0
fix cogvideox autoencoder decode (#9569)
Xiang-cd Oct 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:

env:
DIFFUSERS_IS_CI: yes
HF_HUB_ENABLE_HF_TRANSFER: 1
HF_HOME: /mnt/cache
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ concurrency:

env:
DIFFUSERS_IS_CI: yes
HF_HUB_ENABLE_HF_TRANSFER: 1
OMP_NUM_THREADS: 4
MKL_NUM_THREADS: 4
PYTEST_TIMEOUT: 60
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/push_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ env:
DIFFUSERS_IS_CI: yes
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
HF_HUB_ENABLE_HF_TRANSFER: 1
PYTEST_TIMEOUT: 600
PIPELINE_USAGE_CUTOFF: 50000

Expand Down
1 change: 1 addition & 0 deletions .github/workflows/push_tests_fast.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ env:
HF_HOME: /mnt/cache
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
HF_HUB_ENABLE_HF_TRANSFER: 1
PYTEST_TIMEOUT: 600
RUN_SLOW: no

Expand Down
1 change: 1 addition & 0 deletions .github/workflows/push_tests_mps.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ env:
HF_HOME: /mnt/cache
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
HF_HUB_ENABLE_HF_TRANSFER: 1
PYTEST_TIMEOUT: 600
RUN_SLOW: no

Expand Down
3 changes: 2 additions & 1 deletion docker/diffusers-flax-cpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
numpy==1.26.4 \
scipy \
tensorboard \
transformers
transformers \
hf_transfer

CMD ["/bin/bash"]
3 changes: 2 additions & 1 deletion docker/diffusers-flax-tpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
numpy==1.26.4 \
scipy \
tensorboard \
transformers
transformers \
hf_transfer

CMD ["/bin/bash"]
3 changes: 2 additions & 1 deletion docker/diffusers-onnxruntime-cpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
numpy==1.26.4 \
scipy \
tensorboard \
transformers
transformers \
hf_transfer

CMD ["/bin/bash"]
3 changes: 2 additions & 1 deletion docker/diffusers-onnxruntime-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
numpy==1.26.4 \
scipy \
tensorboard \
transformers
transformers \
hf_transfer

CMD ["/bin/bash"]
3 changes: 2 additions & 1 deletion docker/diffusers-pytorch-compile-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
numpy==1.26.4 \
scipy \
tensorboard \
transformers
transformers \
hf_transfer

CMD ["/bin/bash"]
3 changes: 2 additions & 1 deletion docker/diffusers-pytorch-cpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
numpy==1.26.4 \
scipy \
tensorboard \
transformers matplotlib
transformers matplotlib \
hf_transfer

CMD ["/bin/bash"]
3 changes: 2 additions & 1 deletion docker/diffusers-pytorch-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
scipy \
tensorboard \
transformers \
pytorch-lightning
pytorch-lightning \
hf_transfer

CMD ["/bin/bash"]
3 changes: 2 additions & 1 deletion docker/diffusers-pytorch-xformers-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
scipy \
tensorboard \
transformers \
xformers
xformers \
hf_transfer

CMD ["/bin/bash"]
2 changes: 1 addition & 1 deletion docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
- local: using-diffusers/overview_techniques
title: Overview
- local: training/distributed_inference
title: Distributed inference with multiple GPUs
title: Distributed inference
- local: using-diffusers/merge_loras
title: Merge LoRAs
- local: using-diffusers/scheduler_features
Expand Down
3 changes: 3 additions & 0 deletions docs/source/en/api/pipelines/pag.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ Since RegEx is supported as a way for matching layer identifiers, it is crucial

## StableDiffusionControlNetPAGPipeline
[[autodoc]] StableDiffusionControlNetPAGPipeline

## StableDiffusionControlNetPAGInpaintPipeline
[[autodoc]] StableDiffusionControlNetPAGInpaintPipeline
- all
- __call__

Expand Down
1 change: 1 addition & 0 deletions docs/source/en/api/pipelines/text_to_video_zero.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ To generate a video from prompt, run the following Python code:
```python
import torch
from diffusers import TextToVideoZeroPipeline
import imageio

model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
Expand Down
9 changes: 9 additions & 0 deletions docs/source/en/api/schedulers/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ Many schedulers are implemented from the [k-diffusion](https://github.com/crowso
| N/A | [`DEISMultistepScheduler`] | |
| N/A | [`UniPCMultistepScheduler`] | |

## Noise schedules and schedule types
| A1111/k-diffusion | 🤗 Diffusers |
|--------------------------|----------------------------------------------------------------------------|
| Karras | init with `use_karras_sigmas=True` |
| sgm_uniform | init with `timestep_spacing="trailing"` |
| simple | init with `timestep_spacing="trailing"` |
| exponential | init with `timestep_spacing="linspace"`, `use_exponential_sigmas=True` |
| beta | init with `timestep_spacing="linspace"`, `use_beta_sigmas=True` |

All schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers.

## SchedulerMixin
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/community_projects.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,4 +75,8 @@ Happy exploring, and thank you for being part of the Diffusers community!
<td><a href="https://github.com/cumulo-autumn/StreamDiffusion"> StreamDiffusion </a></td>
<td>A Pipeline-Level Solution for Real-Time Interactive Generation</td>
</tr>
<tr style="border-top: 2px solid black">
<td><a href="https://github.com/Netwrck/stable-diffusion-server"> Stable Diffusion Server </a></td>
<td>A server configured for Inpainting/Generation/img2img with one stable diffusion model</td>
</tr>
</table>
2 changes: 1 addition & 1 deletion docs/source/en/optimization/coreml.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ print(f"Model downloaded at {model_path}")
Once you have downloaded a snapshot of the model, you can test it using Apple's Python script.

```shell
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_original_packages -o </path/to/output/image> --compute-unit CPU_AND_GPU --seed 93
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./models/coreml-stable-diffusion-v1-4_original_packages/original/packages -o </path/to/output/image> --compute-unit CPU_AND_GPU --seed 93
```

Pass the path of the downloaded checkpoint with `-i` flag to the script. `--compute-unit` indicates the hardware you want to allow for inference. It must be one of the following options: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. You may also provide an optional output path, and a seed for reproducibility.
Expand Down
130 changes: 129 additions & 1 deletion docs/source/en/training/distributed_inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->

# Distributed inference with multiple GPUs
# Distributed inference

On distributed setups, you can run inference across multiple GPUs with 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel.

Expand Down Expand Up @@ -109,3 +109,131 @@ torchrun run_distributed.py --nproc_per_node=2

> [!TIP]
> You can use `device_map` within a [`DiffusionPipeline`] to distribute its model-level components on multiple devices. Refer to the [Device placement](../tutorials/inference_with_big_models#device-placement) guide to learn more.

## Model sharding

Modern diffusion systems such as [Flux](../api/pipelines/flux) are very large and have multiple models. For example, [Flux.1-Dev](https://hf.co/black-forest-labs/FLUX.1-dev) is made up of two text encoders - [T5-XXL](https://hf.co/google/t5-v1_1-xxl) and [CLIP-L](https://hf.co/openai/clip-vit-large-patch14) - a [diffusion transformer](../api/models/flux_transformer), and a [VAE](../api/models/autoencoderkl). With a model this size, it can be challenging to run inference on consumer GPUs.

Model sharding is a technique that distributes models across GPUs when the models don't fit on a single GPU. The example below assumes two 16GB GPUs are available for inference.

Start by computing the text embeddings with the text encoders. Keep the text encoders on two GPUs by setting `device_map="balanced"`. The `balanced` strategy evenly distributes the model on all available GPUs. Use the `max_memory` parameter to allocate the maximum amount of memory for each text encoder on each GPU.

> [!TIP]
> **Only** load the text encoders for this step! The diffusion transformer and VAE are loaded in a later step to preserve memory.

```py
from diffusers import FluxPipeline
import torch

prompt = "a photo of a dog with cat-like look"

pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
transformer=None,
vae=None,
device_map="balanced",
max_memory={0: "16GB", 1: "16GB"},
torch_dtype=torch.bfloat16
)
with torch.no_grad():
print("Encoding prompts.")
prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=512
)
```

Once the text embeddings are computed, remove them from the GPU to make space for the diffusion transformer.

```py
import gc

def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()

del pipeline.text_encoder
del pipeline.text_encoder_2
del pipeline.tokenizer
del pipeline.tokenizer_2
del pipeline

flush()
```

Load the diffusion transformer next which has 12.5B parameters. This time, set `device_map="auto"` to automatically distribute the model across two 16GB GPUs. The `auto` strategy is backed by [Accelerate](https://hf.co/docs/accelerate/index) and available as a part of the [Big Model Inference](https://hf.co/docs/accelerate/concept_guides/big_model_inference) feature. It starts by distributing a model across the fastest device first (GPU) before moving to slower devices like the CPU and hard drive if needed. The trade-off of storing model parameters on slower devices is slower inference latency.

```py
from diffusers import FluxTransformer2DModel
import torch

transformer = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
device_map="auto",
torch_dtype=torch.bfloat16
)
```

> [!TIP]
> At any point, you can try `print(pipeline.hf_device_map)` to see how the various models are distributed across devices. This is useful for tracking the device placement of the models.

Add the transformer model to the pipeline for denoising, but set the other model-level components like the text encoders and VAE to `None` because you don't need them yet.

```py
pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", ,
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
vae=None,
transformer=transformer,
torch_dtype=torch.bfloat16
)

print("Running denoising.")
height, width = 768, 1360
latents = pipeline(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=50,
guidance_scale=3.5,
height=height,
width=width,
output_type="latent",
).images
```

Remove the pipeline and transformer from memory as they're no longer needed.

```py
del pipeline.transformer
del pipeline

flush()
```

Finally, decode the latents with the VAE into an image. The VAE is typically small enough to be loaded on a single GPU.

```py
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
import torch

vae = AutoencoderKL.from_pretrained(ckpt_id, subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")
vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)

with torch.no_grad():
print("Running decoding.")
latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor)
latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor

image = vae.decode(latents, return_dict=False)[0]
image = image_processor.postprocess(image, output_type="pil")
image[0].save("split_transformer.png")
```

By selectively loading and unloading the models you need at a given stage and sharding the largest models across multiple GPUs, it is possible to run inference with large models on consumer GPUs.
7 changes: 3 additions & 4 deletions docs/source/en/using-diffusers/callback.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,14 +171,13 @@ def latents_to_rgb(latents):
weights = (
(60, -60, 25, -70),
(60, -5, 15, -50),
(60, 10, -5, -35)
(60, 10, -5, -35),
)

weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.device))
biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.device)
rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
image_array = rgb_tensor.clamp(0, 255)[0].byte().cpu().numpy()
image_array = image_array.transpose(1, 2, 0)
image_array = rgb_tensor.clamp(0, 255).byte().cpu().numpy().transpose(1, 2, 0)

return Image.fromarray(image_array)
```
Expand All @@ -189,7 +188,7 @@ def latents_to_rgb(latents):
def decode_tensors(pipe, step, timestep, callback_kwargs):
latents = callback_kwargs["latents"]

image = latents_to_rgb(latents)
image = latents_to_rgb(latents[0])
image.save(f"{step}.png")

return callback_kwargs
Expand Down
8 changes: 3 additions & 5 deletions examples/cogvideo/train_cogvideox_lora.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,7 @@
from diffusers.models.embeddings import get_3d_rotary_pos_embed
from diffusers.optimization import get_scheduler
from diffusers.pipelines.cogvideo.pipeline_cogvideox import get_resize_crop_region_for_grid
from diffusers.training_utils import (
cast_training_params,
clear_objs_and_retain_memory,
)
from diffusers.training_utils import cast_training_params, free_memory
from diffusers.utils import check_min_version, convert_unet_state_dict_to_peft, export_to_video, is_wandb_available
from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
from diffusers.utils.torch_utils import is_compiled_module
Expand Down Expand Up @@ -726,7 +723,8 @@ def log_validation(
}
)

clear_objs_and_retain_memory([pipe])
del pipe
free_memory()

return videos

Expand Down
Loading
Loading