James Bond is drinking Cocktail🍸.
Cocktail.mp4
Our approach requires only [one generalized model], unlike previous that needed multiple models for mixing multiple modalities.
Different from currently existing schemes, our scheme does not require modifications to the modal prior of the base model Fig.(a), which results in a significant reduction in cost. Also in the face of multiple modalities we do not need multiple models demonstrated in Fig.(b). Cocktail🍸 fuse the information from multiple modalities like Fig.(c) shown.
We propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models.
The parameters indicated by the yellow sections are sourced from the pre-trained model and stay constant, while only those in the blue sections are updated during training, with the gradient back-propagated along the blue arrows. The light grey dashed sections signify additional operations that occur solely during the inference process, specifically, the process of storing attention maps derived from the gControlNet for the sampling stage.
Here, the "cross" symbol ❌ and the checkmark symbol ✅ denote the unmatched and matched modalities, respectively. It is important to note that our model accurately captures all modalities.
- Release Gradio Demo
- Release sampling codes
- Release inference codes
- Release pre-trained models
You can create an anaconda environment called cocktail
with the required dependencies by running:
git clone https://github.com/mhh0318/cocktail.git
cd cocktail
conda env create -f environment.yaml
Download the pretrained models from here, and save it to the root dir.
Gradio demo can be launched by:
python gradio_demo.py [--share]
We use HED, SAN, and OpenPose to extract the sketch map, segmentation map, and human pose map from the image.
- Extract sketch map:
python annotator/hed.py {/path/to/image.png} {/path/to/sketch.png}
- Extract segmentation map:
python annotator/SAN/run.py {/path/to/image.png} {/path/to/seg.png}
- Extract human pose map:
python annotator/openpose/run.py {/path/to/image.png} {/path/to/openpose.png}
For the simultaneous vision-language generation, please run:
python ./inference {args}
args here can be int 0 or 1, as the provided two example conditions.
If the environment is setup correctly, this command should function properly and generate some results in the folder ./samples/results/{args}_sample_{batch}.png
.
Our codebase for the diffusion models builds heavily on ControlNet and Stable Diffusion.
Thanks for the opensourcing!
If you use this code for your research, please cite our paper.
@article{hu2023cocktail,
title = {Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation},
author = {Hu, Minghui and Zheng, Jianbin and Liu, Daqing and Zheng, Chuanxia and Wang, Chaoyue and Tao, Dacheng and Cham, Tat-Jen},
journal = {arXiv},
year = {2023},
}