Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge changes #115

Merged
merged 45 commits into from
Oct 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
bdd2544
Tests compile fixes (#5148)
DN6 Sep 26, 2023
89d8f84
Timestep bias for fine-tuning SDXL (#5094)
bghira Sep 26, 2023
4a06c74
Min-SNR Gamma: follow-up fix for zero-terminal SNR models on v-predic…
bghira Sep 26, 2023
21e402f
fix-VaeImageProcessor-docstring (#5182)
ernestchu Sep 26, 2023
9946dcf
Test Fixes for CUDA Tests and Fast Tests (#5172)
DN6 Sep 26, 2023
fd1c54a
[docs] Improved text-to-image guide (#4938)
stevhliu Sep 26, 2023
d9e7857
timestep_spacing for FlaxDPMSolverMultistepScheduler (#5189)
pcuenca Sep 26, 2023
c82f7ba
[SDXL Flax] fix SDXL flax init (#5187)
patrickvonplaten Sep 26, 2023
16d56c4
F/flax split head dim (#5181)
entrpn Sep 26, 2023
ae2fc01
Wrap lines in docstring (#5190)
pcuenca Sep 26, 2023
ad06e51
[Docs] Improve xformers page (#5196)
patrickvonplaten Sep 27, 2023
940f941
Add `test_full_loop_with_noise` tests to all scheduler with `add_nosi…
yiyixuxu Sep 27, 2023
02247d9
PEFT Integration for Text Encoder to handle multiple alphas/ranks, di…
pacman100 Sep 27, 2023
ba59e92
Fix memory issues in tests (#5183)
DN6 Sep 27, 2023
cdcc01b
[Examples] add `compute_snr()` to training utils. (#5188)
sayakpaul Sep 27, 2023
a584d42
[LoRA, Xformers] Fix xformers lora (#5201)
patrickvonplaten Sep 27, 2023
cac7ada
[Flax SDXL] fix zero out sdxl (#5203)
patrickvonplaten Sep 27, 2023
693a0d0
Remove Offensive Language from Community Pipelines (#5206)
painebenjamin Sep 27, 2023
536c297
Trickle down `split_head_dim` (#5208)
pcuenca Sep 27, 2023
d840253
[`PEFT`] Fix typo for import (#5217)
younesbelkada Sep 28, 2023
1c4c4c4
Correct file name in t2i adapter training readme (#5207)
nbardy Sep 28, 2023
39baf0b
[`PEFT` / `LoRA` ] Kohya checkpoints support (#5202)
younesbelkada Sep 28, 2023
622f35b
fixed vae scaling (#5213)
asparius Sep 28, 2023
c78ee14
Move more slow tests to nightly (#5220)
DN6 Sep 28, 2023
1d3120f
[docs] Quicktour fixes (#5211)
stevhliu Sep 29, 2023
9c03a7d
Fix DDIMInverseScheduler (#5145)
richardSHkim Sep 29, 2023
78a7851
make style
patrickvonplaten Sep 29, 2023
9cfd4ef
Make `BaseOutput` dataclasses picklable (#5234)
cbensimon Sep 29, 2023
cc92332
[`PEFT` / `LoRA` ] Fix text encoder scaling (#5204)
younesbelkada Sep 29, 2023
84e5cc5
Fix doc KO unconditional_image_generation.md (#5236)
Sep 29, 2023
0c7cb9a
Flax: Ignore PyTorch, ONNX files when they coexist with Flax weights …
pcuenca Oct 2, 2023
907fd91
Fixed constants.py not using hugging face hub environment variable (#…
Zanz2 Oct 2, 2023
bbe8d3a
Compile test fixes (#5235)
DN6 Oct 2, 2023
4f74a5e
[PEFT warnings] Only sure deprecation warnings in the future (#5240)
patrickvonplaten Oct 2, 2023
2a62aad
Add docstrings in forward methods of adapter model (#5253)
Nandika-A Oct 2, 2023
db91e71
make style
patrickvonplaten Oct 2, 2023
cd1b8d7
[WIP] Refactor UniDiffuser Pipeline and Tests (#4948)
dg845 Oct 2, 2023
d56825e
fix: how print training resume logs. (#5117)
sayakpaul Oct 2, 2023
37a787a
Add docstring for the AutoencoderKL's decode (#5242)
freespirit Oct 2, 2023
7a4324c
Add a docstring for the AutoencoderKL's encode (#5239)
freespirit Oct 2, 2023
c8b0f0e
Update UniPC to support 1D diffusion. (#5199)
leng-yue Oct 2, 2023
bdd1611
[Schedulers] Fix callback steps (#5261)
patrickvonplaten Oct 2, 2023
2457599
make fix copies
patrickvonplaten Oct 2, 2023
dfcce3c
[Research folder] Add SDXL example (#5275)
patrickvonplaten Oct 3, 2023
7271f8b
Fix UniPC scheduler for 1D (#5276)
patrickvonplaten Oct 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/build_docker_images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ jobs:
image-name:
- diffusers-pytorch-cpu
- diffusers-pytorch-cuda
- diffusers-pytorch-compile-cuda
- diffusers-flax-cpu
- diffusers-flax-tpu
- diffusers-onnxruntime-cpu
Expand Down
48 changes: 46 additions & 2 deletions .github/workflows/push_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,11 +74,11 @@ jobs:
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
CUBLAS_WORKSPACE_CONFIG: :16:8

run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "not Flax and not Onnx" \
-s -v -k "not Flax and not Onnx and not compile" \
--make-reports=tests_${{ matrix.config.report }} \
tests/

Expand Down Expand Up @@ -113,6 +113,50 @@ jobs:
name: ${{ matrix.config.report }}_test_reports
path: reports

run_torch_compile_tests:
name: PyTorch Compile CUDA tests

runs-on: docker-gpu

container:
image: diffusers/diffusers-pytorch-compile-cuda
options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/

steps:
- name: Checkout diffusers
uses: actions/checkout@v3
with:
fetch-depth: 2

- name: NVIDIA-SMI
run: |
nvidia-smi

- name: Install dependencies
run: |
python -m pip install -e .[quality,test,training]

- name: Environment
run: |
python utils/print_env.py

- name: Run example tests on GPU
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/

- name: Failure short reports
if: ${{ failure() }}
run: cat reports/tests_torch_compile_cuda_failures_short.txt

- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: torch_compile_test_reports
path: reports

run_examples_tests:
name: Examples PyTorch CUDA tests on Ubuntu

Expand Down
48 changes: 48 additions & 0 deletions docker/diffusers-pytorch-compile-cuda/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04
LABEL maintainer="Hugging Face"
LABEL repository="diffusers"

ENV DEBIAN_FRONTEND=noninteractive

RUN apt update && \
apt install -y bash \
build-essential \
git \
git-lfs \
curl \
ca-certificates \
libsndfile1-dev \
libgl1 \
python3.9 \
python3.9-dev \
python3-pip \
python3.9-venv && \
rm -rf /var/lib/apt/lists

# make sure to use venv
RUN python3.9 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
RUN python3.9 -m pip install --no-cache-dir --upgrade pip && \
python3.9 -m pip install --no-cache-dir \
torch \
torchvision \
torchaudio \
invisible_watermark && \
python3.9 -m pip install --no-cache-dir \
accelerate \
datasets \
hf-doc-builder \
huggingface-hub \
Jinja2 \
librosa \
numpy \
scipy \
tensorboard \
transformers \
omegaconf \
pytorch-lightning \
xformers

CMD ["/bin/bash"]
18 changes: 4 additions & 14 deletions docs/source/en/optimization/memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -321,21 +321,9 @@ with torch.inference_mode():

Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/pdf/2205.14135.pdf) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)).

The table below details the speed-ups from a few different Nvidia GPUs when running inference on image sizes of 512x512 and a batch size of 1 (one prompt):

| GPU | base attention (fp16) | memory-efficient attention (fp16) |
|------------------|-----------------------|-----------------------------------|
| NVIDIA Tesla T4 | 3.5it/s | 5.5it/s |
| NVIDIA 3060 RTX | 4.6it/s | 7.8it/s |
| NVIDIA A10G | 8.88it/s | 15.6it/s |
| NVIDIA RTX A6000 | 11.7it/s | 21.09it/s |
| NVIDIA TITAN RTX | 12.51it/s | 18.22it/s |
| A100-SXM4-40GB | 18.6it/s | 29.it/s |
| A100-SXM-80GB | 18.7it/s | 29.5it/s |

<Tip warning={true}>
<Tip>

If you have PyTorch 2.0 installed, you shouldn't use xFormers!
If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling `xformers`.

</Tip>

Expand Down Expand Up @@ -365,3 +353,5 @@ with torch.inference_mode():
# optional: You can disable it via
# pipe.disable_xformers_memory_efficient_attention()
```

The iteration speed when using `xformers` should match the iteration speed of Torch 2.0 as described [here](torch2.0).
11 changes: 10 additions & 1 deletion docs/source/en/optimization/torch2.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 |
| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 |
| IF | 20.21 / <br>13.84 / <br>24.00 | 20.12 / <br>13.70 / <br>24.03 | ❌ | 97.34 / <br>27.23 / <br>111.66 |
| SDXL - txt2img | 8.64 | 9.9 | - | - |

### A100 (batch size: 4)

Expand All @@ -286,6 +287,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 |
| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 |
| IF | 25.02 | 18.04 | ❌ | 48.47 |
| SDXL - txt2img | 2.44 | 2.74 | - | - |

### A100 (batch size: 16)

Expand All @@ -296,6 +298,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 |
| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 |
| IF | 8.78 | 9.82 | ❌ | 16.77 |
| SDXL - txt2img | 0.64 | 0.72 | - | - |

### V100 (batch size: 1)

Expand Down Expand Up @@ -336,6 +339,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 |
| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 |
| IF | 17.42 / <br>2.47 / <br>18.52 | 16.96 / <br>2.45 / <br>18.69 | ❌ | 24.63 / <br>2.47 / <br>23.39 |
| SDXL - txt2img | 1.15 | 1.16 | - | - |

### T4 (batch size: 4)

Expand All @@ -346,6 +350,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 |
| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 |
| IF | 5.79 | 5.61 | ❌ | 7.39 |
| SDXL - txt2img | 0.288 | 0.289 | - | - |

### T4 (batch size: 16)

Expand All @@ -356,6 +361,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s |
| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
| IF * | 1.44 | 1.44 | ❌ | 1.94 |
| SDXL - txt2img | OOM | OOM | - | - |

### RTX 3090 (batch size: 1)

Expand Down Expand Up @@ -396,6 +402,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 |
| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 |
| IF | 69.71 / <br>18.78 / <br>85.49 | 69.13 / <br>18.80 / <br>85.56 | ❌ | 124.60 / <br>26.37 / <br>138.79 |
| SDXL - txt2img | 6.8 | 8.18 | - | - |

### RTX 4090 (batch size: 4)

Expand All @@ -406,6 +413,7 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 |
| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 |
| IF | 31.88 | 31.14 | ❌ | 43.92 |
| SDXL - txt2img | 2.19 | 2.35 | - | - |

### RTX 4090 (batch size: 16)

Expand All @@ -416,10 +424,11 @@ In the following tables, we report our findings in terms of the *number of itera
| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 |
| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 |
| IF | 9.26 | 9.2 | ❌ | 13.31 |
| SDXL - txt2img | 0.52 | 0.53 | - | - |

## Notes

* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks.
* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1.

*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*
*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*
Loading
Loading