DreamWaltz-G is a novel learning framework for text-driven 3D avatar creation and expressive whole-body animation. Its core design lies in hybrid 3D Gaussian avatar representation and skeleton-guided 2D diffusion. Our method supports diverse applications like shape control & editing, 2D human video reenactment, and 3D Gaussian scene composition.
- [2024-11-20] 🔥[New feature] Reenact arbitrary in-the-wild video with our avatars! (thank @gt732)
- [2024-10-15] 🔥Release the training and inference code.
- [2024-10-15] 🔥Release the pre-trained models of 12 full-body 3D Gaussian avatars ready for inference.
- [2024-10-15] 🔥Release a dataset for 2D human video reenactment. It comprises 19 human motion scenes with original videos, inpainted videos (where humans are removed), SMPL-X motions, and camera parameters.
- [2024-09-26] 📢Publish the arXiv preprint and update the project page.
Please follow the instructions below to get the code and install dependencies.
- Clone this repository and navigate to
DreamWaltz-G
folder:
git clone --branch main --single-branch https://github.com/Yukun-Huang/DreamWaltz-G.git
cd DreamWaltz-G
- Install packages. Note that
requirements.txt
is automatically generated and may not be accurate. We recommend using the provided script for installation:
bash scripts/install.sh
- Activate the installed conda environment:
conda activate dreamwaltz
- [Optonal] Similar to DreamWaltz, the cuda extensions (heavily borrowed from stable-dreamfusion and latent-nerf) for Instant-NGP are required and will be built at runtime. But if you want to build them manually, the following commands may be useful:
python -m core.nerf.freqencoder.backend
python -m core.nerf.gridencoder.backend
python -m core.nerf.raymarching.rgb.backend
# python -m core.nerf.raymarching.latent.backend # uncomment this if you want to use Latent-NeRF
Before running the code, you need to prepare the human template models: SMPL-X, FLAME, and VPoser. Please download them from the official project pages: https://smpl-x.is.tue.mpg.de/ and https://flame.is.tue.mpg.de/, then organize them following the structure below:
external
└── human_templates
├── smplx
│ ├── SMPLX_NEUTRAL_2020.npz
│ ├── FLAME_vertex_ids.npy
│ ├── MANO_vertex_ids.pkl
│ └── smplx_vert_segmentation.json
├── flame
│ └── FLAME_masks.pkl
└── vposer
└── v2.0
├── snapshots
│ ├── V02_05_epoch=08_val_loss=0.03.ckpt
│ └── V02_05_epoch=13_val_loss=0.03.ckpt
├── V02_05.yaml
└── V02_05.log
If you already have these models on your machine, you can simply modify the path in configs/path.py
to link to them.
DreamWaltz-G adopts a two-stage training pipeline of NeRF→3DGS, where NeRF is initialized with SMPL-X before training. We provide these pre-trained NeRFs (Instant-NGP, specifically) in HuggingFace. You may download and organize them following the structure below:
external
└── human_templates
├── instant-ngp
│ ├── adult_neutral
│ │ ├── step_005000.pth
│ │ └── 005000_image.mp4
...
In particular, if you want to train them yourself, you can simply run the script:
bash scripts/pretrain_nerf.sh
We provide the pre-trained weights of 12 full-body 3D Gaussian avatars, ready for 3D animation and 2D video reenactment without training. You may download them from HuggingFace and organize them following the structure below:
outputs
├── w_expr
│ ├── a_chef_dressed_in_white
│ ├── a_gardener_in_overalls_and_a_wide-brimmed_hat
│ └── ...
└── wo_expr
├── a_clown
├── black_widow
└── ...
Unfortunately, due to limitations of DreamWaltz-G and SMPL-X, not all of these avatars support expression control. Specifically, the avatars in w_expr
supports expression control (e.g., realistic humans), while the avatars in wo_expr
does not support expression control (e.g., fictional characters).
As a score distillation-based method, DreamWaltz-G is supervised by a pre-trained 2D diffusion model and requires no training data. The data introduced below is only used for inference.
We provide data loaders to read smpl-x motion sequences from four publicly available human motion datasets: Motion-X, TalkSHOW, AIST++, 3DPW. These motion data can be used to animate our 3D avatars for various demos.
To use these datasets, you may download them from the official website and organize them according to the following structure (no need to unzip):
datasets
├── 3DPW
│ ├── readme_and_demo.zip
│ ├── sequenceFiles.zip
│ └── SMPL-X.zip
├── AIST++
│ ├── 20210308_cameras.zip
│ └── 20210308_motions.zip
├── Motion-X
│ └── motionx_smplx.zip
└── TalkShow
├── chemistry_pkl_tar.tar.gz
├── conan_pkl_tar.tar.gz
└── ...
For more details, please refer to our code in data/human/
.
We build a new dataset from Motion-X for 2D human video reenactment. It comprises 19 human motion scenes with original videos, inpainted videos (where humans are removed), SMPL-X motions, and camera parameters. You may download this dataset from HuggingFace and place it according to the structure below (no need to unzip):
datasets
├── Motion-X-ReEnact
│ └── Motion-X-ReEnact.zip
...
Based on this dataset, the generated 3D avatars can be projected onto 2D inpainted videos to achieve motion reenactment. We hope that this dataset can assist future work in evaluating the human video reenactment task.
To create a full-body 3D avatar from texts with expression control (applicable to realistic humans), you may run the command:
bash scripts/train_w_expr.sh "a chef dressed in white"
To create a full-body 3D avatar from texts without expression control (applicable to most cases), you may run the command:
bash scripts/train_wo_expr.sh "Rapunzel in Tangled"
From our training script, you may notice that we split the two-stage training pipeline into 5 sub-stages, which helps with debugging and ablation analysis.
The whole training takes several hours on a single NVIDIA L40S GPU.
Assuming you have downloaded the pre-trained 3D avatars and placed them correctly, you can run the following scripts to visualize the 3D avatars in their canonical poses:
bash scripts/inference_canonical.sh
The results are saved as images and videos in the respective model directories.
canonical.mp4
Assuming you have downloaded the pre-trained 3D avatars and placed them correctly, you can run the following scripts to animate them using the SMPL-X motion sequences stored in assets/motions/
.
For 3D animation using motions from TalkSHOW (w/ expression control), you may run:
bash scripts/inference_talkshow.sh
The results are saved as images and videos in the respective model directories.
aist.mp4
For 3D animation using motions from AIST++ (w/o expression control), you may run:
bash scripts/inference_aist.sh
The results are saved as images and videos in the respective model directories.
talkshow.mp4
We provide an inference script for 2D human video reenactment. Please download our dataset first and place the zip file in datasets/Motion-X-ReEnact/
. Once the pre-trained avatars and data are ready, you may run:
bash scripts/inference_reenact.sh
The results are saved as images and videos in the respective model directories.
motionx-reenact.mp4
To reenact your own video, 3D human pose and camera estimation are needed. We recommend using tram to extract SMPL and camera parameters, and then use our code for reenactment. As a demonstration, we provide a video example and its tram-estimated parameters in HuggingFace. Once the pre-trained avatars and data are ready, you may run:
bash scripts/inference_tram.sh
The results are saved as images and videos in the respective model directories.
tram.mp4
1. The generation results are not satisfactory and suffer from problems such as over-saturation, partial missing, and blurring.
DreamWaltz-G utilizes stable-diffusion-v1-5 and vanilla SDS for learning 3D representations, and thus inherits the defects of these methods. We recommend adopting more advanced diffusion models and score distillation techniques, such as ControlNeXt and ISM.
Even using a 2D diffusion model with face landmark control, learning accurate 3D expression control via score distillation remains challenging. The expression control of DreamWaltz-G is largely benefited from SMPL-X. Therefore, when the face of the generated 3D avatar deviate significantly from the SMPL-X template, the expression control will be inaccurate.
Building on DreamWaltz-G, there are many possible further explorations: relightable 3D avatars; disentangled 3D avatars; physical 3D avatars; image-driven avatar creation; human-object interaction; automatic skeletal rigging; human video generation/reenactment; etc.
Please feel free to contact me if you have any questions, thoughts or opportunities for academic collaboration.
This repository is based on many amazing research works and open-source projects: gaussian-splatting, diffusers, stable-dreamfusion, latent-nerf, threestudio, Deformable-3D-Gaussians, diff-gaussian-rasterization, gaussian-mesh-splatting, SuGaR, smplx, etc. Thanks all the authors for their selfless contributions to the community!
If you find this repository helpful for your work, please consider citing it as follows:
@article{huang2024dreamwaltz-g,
title={{DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion}},
author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Zha, Zheng-Jun and Zhang, Lei and Liu, Xihui},
year={2024},
eprint={arXiv preprint arXiv:2409.17145},
archivePrefix={arXiv},
primaryClass={cs.CV},
}
@inproceedings{huang2024dreamwaltz,
title={{DreamWaltz: Make a Scene with Complex 3D Animatable Avatars}},
author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Cao, He and Qi, Xianbiao and Shi, Yukai and Zha, Zheng-Jun and Zhang, Lei},
booktitle={Advances in Neural Information Processing Systems},
pages={4566--4584},
year={2023}
}