DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

Yukun Huang, Jianan Wang, Ailing Zeng, Zheng-Jun Zha, Lei Zhang, Xihui Liu

🪄 Introduction

DreamWaltz-G is a novel learning framework for text-driven 3D avatar creation and expressive whole-body animation. Its core design lies in hybrid 3D Gaussian avatar representation and skeleton-guided 2D diffusion. Our method supports diverse applications like shape control & editing, 2D human video reenactment, and 3D Gaussian scene composition.

📢 News

[2024-11-20] 🔥[New feature] Reenact arbitrary in-the-wild video with our avatars! (thank @gt732)
[2024-10-15] 🔥Release the training and inference code.
[2024-10-15] 🔥Release the pre-trained models of 12 full-body 3D Gaussian avatars ready for inference.
[2024-10-15] 🔥Release a dataset for 2D human video reenactment. It comprises 19 human motion scenes with original videos, inpainted videos (where humans are removed), SMPL-X motions, and camera parameters.
[2024-09-26] 📢Publish the arXiv preprint and update the project page.

⚙️ Setup

Please follow the instructions below to get the code and install dependencies.

Clone this repository and navigate to DreamWaltz-G folder:

git clone --branch main --single-branch https://github.com/Yukun-Huang/DreamWaltz-G.git
cd DreamWaltz-G

Install packages. Note that requirements.txt is automatically generated and may not be accurate. We recommend using the provided script for installation:

bash scripts/install.sh

Activate the installed conda environment:

conda activate dreamwaltz

[Optonal] Similar to DreamWaltz, the cuda extensions (heavily borrowed from stable-dreamfusion and latent-nerf) for Instant-NGP are required and will be built at runtime. But if you want to build them manually, the following commands may be useful:

python -m core.nerf.freqencoder.backend
python -m core.nerf.gridencoder.backend
python -m core.nerf.raymarching.rgb.backend
# python -m core.nerf.raymarching.latent.backend  # uncomment this if you want to use Latent-NeRF

🤖 Models

1. Human Templates (Required for Training and Inference)

Before running the code, you need to prepare the human template models: SMPL-X, FLAME, and VPoser. Please download them from the official project pages: https://smpl-x.is.tue.mpg.de/ and https://flame.is.tue.mpg.de/, then organize them following the structure below:

external
└── human_templates
    ├── smplx
    │   ├── SMPLX_NEUTRAL_2020.npz
    │   ├── FLAME_vertex_ids.npy
    │   ├── MANO_vertex_ids.pkl
    │   └── smplx_vert_segmentation.json
    ├── flame
    │   └── FLAME_masks.pkl
    └── vposer
        └── v2.0
            ├── snapshots
            │   ├── V02_05_epoch=08_val_loss=0.03.ckpt
            │   └── V02_05_epoch=13_val_loss=0.03.ckpt
            ├── V02_05.yaml
            └── V02_05.log

If you already have these models on your machine, you can simply modify the path in configs/path.py to link to them.

2. Pre-trained Instant-NGP (Required for Training)

DreamWaltz-G adopts a two-stage training pipeline of NeRF→3DGS, where NeRF is initialized with SMPL-X before training. We provide these pre-trained NeRFs (Instant-NGP, specifically) in HuggingFace. You may download and organize them following the structure below:

external
└── human_templates
    ├── instant-ngp
    │   ├── adult_neutral
    │   │   ├── step_005000.pth
    │   │   └── 005000_image.mp4
    ...

In particular, if you want to train them yourself, you can simply run the script:

bash scripts/pretrain_nerf.sh

3. Pre-trained 3D Avatars (Ready for Inference)

We provide the pre-trained weights of 12 full-body 3D Gaussian avatars, ready for 3D animation and 2D video reenactment without training. You may download them from HuggingFace and organize them following the structure below:

outputs
├── w_expr
│   ├── a_chef_dressed_in_white
│   ├── a_gardener_in_overalls_and_a_wide-brimmed_hat
│   └── ...
└── wo_expr
    ├── a_clown
    ├── black_widow
    └── ...

Unfortunately, due to limitations of DreamWaltz-G and SMPL-X, not all of these avatars support expression control. Specifically, the avatars in w_expr supports expression control (e.g., realistic humans), while the avatars in wo_expr does not support expression control (e.g., fictional characters).

💼 Datasets

As a score distillation-based method, DreamWaltz-G is supervised by a pre-trained 2D diffusion model and requires no training data. The data introduced below is only used for inference.

1. SMPL(-X) Motion Datasets for Expressive 3D Animation

We provide data loaders to read smpl-x motion sequences from four publicly available human motion datasets: Motion-X, TalkSHOW, AIST++, 3DPW. These motion data can be used to animate our 3D avatars for various demos.

To use these datasets, you may download them from the official website and organize them according to the following structure (no need to unzip):

datasets
├── 3DPW
│   ├── readme_and_demo.zip
│   ├── sequenceFiles.zip
│   └── SMPL-X.zip
├── AIST++
│   ├── 20210308_cameras.zip
│   └── 20210308_motions.zip
├── Motion-X
│   └── motionx_smplx.zip
└── TalkShow
    ├── chemistry_pkl_tar.tar.gz
    ├── conan_pkl_tar.tar.gz
    └── ...

For more details, please refer to our code in data/human/.

2. Our Video-Motion Dataset for Human Video Reenactment

We build a new dataset from Motion-X for 2D human video reenactment. It comprises 19 human motion scenes with original videos, inpainted videos (where humans are removed), SMPL-X motions, and camera parameters. You may download this dataset from HuggingFace and place it according to the structure below (no need to unzip):

datasets
├── Motion-X-ReEnact
│   └── Motion-X-ReEnact.zip
...

Based on this dataset, the generated 3D avatars can be projected onto 2D inpainted videos to achieve motion reenactment. We hope that this dataset can assist future work in evaluating the human video reenactment task.

💃 Training

To create a full-body 3D avatar from texts with expression control (applicable to realistic humans), you may run the command:

bash scripts/train_w_expr.sh "a chef dressed in white"

To create a full-body 3D avatar from texts without expression control (applicable to most cases), you may run the command:

bash scripts/train_wo_expr.sh "Rapunzel in Tangled"

From our training script, you may notice that we split the two-stage training pipeline into 5 sub-stages, which helps with debugging and ablation analysis.

The whole training takes several hours on a single NVIDIA L40S GPU.

🕺 Inference

1. Avatars in Canonical Pose

Assuming you have downloaded the pre-trained 3D avatars and placed them correctly, you can run the following scripts to visualize the 3D avatars in their canonical poses:

bash scripts/inference_canonical.sh

The results are saved as images and videos in the respective model directories.

canonical.mp4

2. Expressive 3D Animation

Assuming you have downloaded the pre-trained 3D avatars and placed them correctly, you can run the following scripts to animate them using the SMPL-X motion sequences stored in assets/motions/.

For 3D animation using motions from TalkSHOW (w/ expression control), you may run:

bash scripts/inference_talkshow.sh

The results are saved as images and videos in the respective model directories.

aist.mp4

For 3D animation using motions from AIST++ (w/o expression control), you may run:

bash scripts/inference_aist.sh

The results are saved as images and videos in the respective model directories.

talkshow.mp4

3. Human Video Reenactment

We provide an inference script for 2D human video reenactment. Please download our dataset first and place the zip file in datasets/Motion-X-ReEnact/. Once the pre-trained avatars and data are ready, you may run:

bash scripts/inference_reenact.sh

The results are saved as images and videos in the respective model directories.

motionx-reenact.mp4

4. Human Video Reenactment for In-the-wild Video

To reenact your own video, 3D human pose and camera estimation are needed. We recommend using tram to extract SMPL and camera parameters, and then use our code for reenactment. As a demonstration, we provide a video example and its tram-estimated parameters in HuggingFace. Once the pre-trained avatars and data are ready, you may run:

bash scripts/inference_tram.sh

The results are saved as images and videos in the respective model directories.

tram.mp4

🗣️ Discussions

1. The generation results are not satisfactory and suffer from problems such as over-saturation, partial missing, and blurring.

DreamWaltz-G utilizes stable-diffusion-v1-5 and vanilla SDS for learning 3D representations, and thus inherits the defects of these methods. We recommend adopting more advanced diffusion models and score distillation techniques, such as ControlNeXt and ISM.

2. Expression control is not accurate, especially for fictional characters.

Even using a 2D diffusion model with face landmark control, learning accurate 3D expression control via score distillation remains challenging. The expression control of DreamWaltz-G is largely benefited from SMPL-X. Therefore, when the face of the generated 3D avatar deviate significantly from the SMPL-X template, the expression control will be inaccurate.

3. Related topics and future explorations.

Building on DreamWaltz-G, there are many possible further explorations: relightable 3D avatars; disentangled 3D avatars; physical 3D avatars; image-driven avatar creation; human-object interaction; automatic skeletal rigging; human video generation/reenactment; etc.

4. The world coordinate and camera coordinate systems for DreamWaltz-G.

Please feel free to contact me if you have any questions, thoughts or opportunities for academic collaboration.

👏 Acknowledgement

This repository is based on many amazing research works and open-source projects: gaussian-splatting, diffusers, stable-dreamfusion, latent-nerf, threestudio, Deformable-3D-Gaussians, diff-gaussian-rasterization, gaussian-mesh-splatting, SuGaR, smplx, etc. Thanks all the authors for their selfless contributions to the community!

😉 Citation

If you find this repository helpful for your work, please consider citing it as follows:

@article{huang2024dreamwaltz-g,
  title={{DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion}},
  author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Zha, Zheng-Jun and Zhang, Lei and Liu, Xihui},
  year={2024},
  eprint={arXiv preprint arXiv:2409.17145},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
}

@inproceedings{huang2024dreamwaltz,
  title={{DreamWaltz: Make a Scene with Complex 3D Animatable Avatars}},
  author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Cao, He and Qi, Xianbiao and Shi, Yukai and Zha, Zheng-Jun and Zhang, Lei},
  booktitle={Advances in Neural Information Processing Systems},
  pages={4566--4584},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
configs		configs
core		core
data		data
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

🪄 Introduction

📢 News

⚙️ Setup

🤖 Models

1. Human Templates (Required for Training and Inference)

2. Pre-trained Instant-NGP (Required for Training)

3. Pre-trained 3D Avatars (Ready for Inference)

💼 Datasets

1. SMPL(-X) Motion Datasets for Expressive 3D Animation

2. Our Video-Motion Dataset for Human Video Reenactment

💃 Training

🕺 Inference

1. Avatars in Canonical Pose

2. Expressive 3D Animation

3. Human Video Reenactment

4. Human Video Reenactment for In-the-wild Video

🗣️ Discussions

1. The generation results are not satisfactory and suffer from problems such as over-saturation, partial missing, and blurring.

2. Expression control is not accurate, especially for fictional characters.

3. Related topics and future explorations.

4. The world coordinate and camera coordinate systems for DreamWaltz-G.

Please feel free to contact me if you have any questions, thoughts or opportunities for academic collaboration.

👏 Acknowledgement

😉 Citation

About

Releases

Packages

Contributors 2

Languages

Yukun-Huang/DreamWaltz-G

Folders and files

Latest commit

History

Repository files navigation

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

🪄 Introduction

📢 News

⚙️ Setup

🤖 Models

1. Human Templates (Required for Training and Inference)

2. Pre-trained Instant-NGP (Required for Training)

3. Pre-trained 3D Avatars (Ready for Inference)

💼 Datasets

1. SMPL(-X) Motion Datasets for Expressive 3D Animation

2. Our Video-Motion Dataset for Human Video Reenactment

💃 Training

🕺 Inference

1. Avatars in Canonical Pose

2. Expressive 3D Animation

3. Human Video Reenactment

4. Human Video Reenactment for In-the-wild Video

🗣️ Discussions

1. The generation results are not satisfactory and suffer from problems such as over-saturation, partial missing, and blurring.

2. Expression control is not accurate, especially for fictional characters.

3. Related topics and future explorations.

4. The world coordinate and camera coordinate systems for DreamWaltz-G.

Please feel free to contact me if you have any questions, thoughts or opportunities for academic collaboration.

👏 Acknowledgement

😉 Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages