Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance (NeurIPS 2024)
Kuan Heng Lin1*, Sicheng Mo1*, Ben Klingher1, Fangzhou Mu2, Bolei Zhou1
1UCLA 2NVIDIA
*Equal contribution
Our code is built on top of diffusers v0.28.0
. To set up the environment, please run the following.
conda env create -f environment.yaml
conda activate ctrlx
We provide a user interface for testing our method. Running the following command starts the demo.
python app_ctrlx.py
We also provide a script for running our method. This is equivalent to the Gradio demo.
python run_ctrlx.py \
--structure_image assets/images/horse__point_cloud.jpg \
--appearance_image assets/images/horse.jpg \
--prompt "a photo of a horse standing on grass" \
--structure_prompt "a 3D point cloud of a horse"
If appearance_image
is not provided, then Ctrl-X does structure-only control. If structure_image
is not provided, then Ctrl-X does appearance-only control.
There are three optional arguments for both app_ctrlx.py
and run_ctrlx.py
:
model_offload
(flag): If enabled, offloads each component of both the base model and refiner to the CPU when not in use, reducing memory usage while slightly increasing inference time.- To use
model_offload
,accelerate
must be installed. This must be done manually withpip install accelerate
asenvironment.yaml
does not haveaccelerate
listed.
- To use
sequential_offload
(flag): If enabled, offloads each layer of both the base model and refiner to the CPU when not in use, significantly reducing memory usage while massively increasing inference time.- Similarly,
accelerate
must be installed to usesequential_offload
. - If both
model_offload
andsequential_offload
are enabled, then our code defaults tosequential_offload
.
- Similarly,
disable_refiner
(flag): If enabled, disables the refiner (and does not load it), reducing memory usage.model
(str
): When provided asafetensor
checkpoint path, loads the checkpoint for the base model.
Approximate GPU VRAM usage for the Gradio demo and script (structure and appearance control) on a single NVIDIA RTX A6000 is as follows.
Flags | Inference time (s) | GPU VRAM usage (GiB) |
---|---|---|
None | 28.8 | 18.8 |
model_offload |
38.3 | 12.6 |
sequential_offload |
169.3 | 3.8 |
disable_refiner |
25.5 | 14.5 |
model_offload + disable_refiner |
31.7 | 7.4 |
sequential_offload + disable_refiner |
151.4 | 3.8 |
Here, VRAM usage is obtained via torch.cuda.max_memory_reserved()
, which is the closest option in PyTorch to nvidia-smi
numbers but is probably still an underestimation. You can obtain these numbers on your own hardware by adding the benchmark
flag for run_ctrlx.py
.
Have fun playing around with Ctrl-X! :D
- Add dataset for quantitative evaluation.
- Add support for arbitrary schedulers besides DDIM, not necessarily with self-recurrence (if not possible).
- Add support for DiTs, including SD3 and FLUX.1.
- Add support for video generation models, including CogVideoX and Mochi 1.
For any questions, thoughts, discussions, and any other things you want to reach out for, please contact Jordan Lin ([email protected]).
If you use our code in your research, please cite the following work.
@inproceedings{lin2024ctrlx,
author = {Lin, {Kuan Heng} and Mo, Sicheng and Klingher, Ben and Mu, Fangzhou and Zhou, Bolei},
booktitle = {Advances in Neural Information Processing Systems},
title = {Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance},
year = {2024}
}