Skip to content

📖 A curated list of resources dedicated to avatar.

Notifications You must be signed in to change notification settings

Jason-cs18/awesome-avatar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-avatar

This is a repository for organizing papers, codes and other resources related to the topic of Avatar (talking-face and talking-body).

🔆 This project is still on-going, pull requests are welcomed!!

If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and pull a request.

News

  • 2024.09.07: add ASR and TTS tool
  • 2024.08.24: add backgrounds for image/video generations
  • 2024.08.24: re-organize paper list with table formating
  • 2024.08.24: add works about full-body avatar synthesis

TO DO LIST

  • Main paper list
  • Researchers list
  • Toolbox for avatar
  • Add paper link
  • Add paper notes
  • Add codes if have
  • Add project page if have
  • Datasets and metrics
  • Related links

Researchers and labs

  1. NVIDIA Research
  2. Aliaksandr Siarohin @ Snap Research
  3. Ziwei Liu @ Nanyang Technological University
  4. Xiaodong Cun @ Tencent AI Lab:
  1. Max Planck Institute for Informatics:

Papers

Image and video generation

Model Paper Blog Codebase Note
StyleGANv3 Alias-Free Generative Adversarial Networks, NVIDIA, NeurIPS 2021 The Evolution of StyleGAN: Introduction Code high fidlity face generation
EG3D EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks, NVIDIA, CVPR 2022 Code 3D-aware GAN
Stable Diffusion High-Resolution Image Synthesis with Latent Diffusion Models, Heidelberg University, CVPR 2022 What are Diffusion Models? Code diverse and high quality images
Stable Video Diffusion Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets, Stability AI, arXiv 2023 Diffusion Models for Video Generation Code
DiT Scalable Diffusion Models with Transformers, Meta, ICCV 2023 Diffusion Transformed Code magic behind OpenAI Sora
VQ-VAE Neural Discrete Representation Learning, DeepMind, NIPS 2017 OpenAI's DALL-E 2 and DALL-E 1 Explained magic behinds OpenAI DALL-E
NeRF NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, UC Berkeley, ECCV 2020 NeRF Explosion 2020 Code 3D synthesis via volume rendering
3DGS 3D Gaussian Splatting for Real-Time Radiance Field Rendering, Inria, SIGGRAPH 2023 A Comprehensive Overview of Gaussian Splatting Code real-time 3d rendering

3D Avatar (face+body)

Conference Paper Affiliation Codebase Notes
CVPR 2021 Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors Tsinghua University Dataset
ECCV 2022 HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling Shanghai Artificial Intelligence Laboratory Dataset
SIGGRAPH 2023 AvatarReX: Real-time Expressive Full-body Avatars Tsinghua University Dataset
arXiv 2024 A Survey on 3D Human Avatar Modeling - From Reconstruction to Generation The University of Hong Kong
arXiv 2024 From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations Meta Reality Labs Research Code Github stars Github forks conversational avatar
CVPR 2024 Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling Tsinghua Univserity Code Github stars Github forks
CVPR 2024 4K4D: Real-Time 4D View Synthesis at 4K Resolution Zhejiang University Code Github stars Github forks real-time synthesis with 3DGS

2D talking-face synthesis

Conference Paper Affiliation Codebase Training Code Notes
MM 2020 Wav2Lip: Accurately Lip-sync Videos to Any Speech The International Institute of Islamic Thought (IIIT), India Code Github stars Github forks most accurate lip-sync model, bad video quality 96*96, pre-trained on ~180 hours video data from LRS2
MM 2021 Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis Tsinghua University Code, Github stars Github forks
CVPR 2021 Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation The Chinese University of Hong Kong Code Github stars Github forks contrastive learning on audio-lip
ICCV 2021 PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering Peking University Code Github stars Github forks
ECCV 2022 StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN Tsinghua University Code Github stars Github forks High-fidenity synthesis via StyleGAN
SIGGRAPH Asia 2022 VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild Xidian University Code Github stars Github forks
AAAI 2023 DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video Virtual Human Group, Netease Fuxi AI Lab CodeGithub stars Github forks accurate lip-sync and high-quality synthesis (256*256)
CVPR 2023 SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation Xi'an Jiaotong University Code Github stars Github forks, Note
arXiv 2023 DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models Tsinghua University Code, Github stars Github forks diffusion
Tencent TMElyralab MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting Github stars Github forks
arXiv 2024 LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control Kuaishou Technology Code Github stars Github forks face reenactment with micro-expression
arXiv 2024 EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions Ant Group Code Github stars Github forks accurate lip-sync on Chinese speakers, diffusion, pre-trained on 540 hours cleaned video data (collected from internet)
arXiv 2024 Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation Fudan University Code, Github stars Github forks accurate lip-sync, diffusion, pre-trained on 264 hours of cleaned video data (155 hours from internet and 9 hours from HDTF)
[arXiv 2024] Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency Zhejiang University and ByteDance expressive animation driven by audio only, pre-trained on 160 hours of cleaned video data (collected from internet)

3D talking-face synthesis

Conference Paper Affiliation Codebase Notes
ICCV 2021 AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis University of Science and Technology of China CodeGithub starsGithub forks
ECCV 2022 Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis Tsinghua University CodeGithub starsGithub forks
ICLR 2023 GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis Zhejiang University CodeGithub starsGithub forks
ICCV 2023 Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis Beihang University CodeGithub starsGithub forks
arXiv 2023 GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation Zhejiang University CodeGithub starsGithub forks
CVPR 2024 SyncTalk: The Devil is in the Synchronization for Talking Head Synthesi Renmin University of China CodeGithub starsGithub forks
ECCV 2024 TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting Beihang University CodeGithub starsGithub forks

Talking-body synthesis

Pose2video

Conference Paper Affiliation Codebase Notes
NeurIPS 2018 Video-to-Video Synthesis NVIDIA Code Github stars Github forks
ICCV 2019 Everybody Dance Now UC Berkeley CodeGithub starsGithub forks
arXiv 2023 Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation Alibaba Group CodeGithub starsGithub forks
CVPR 2024 MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model National University of Singapore CodeGithub starsGithub forks
arXiv 2024 Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance Nanjing University CodeGithub starsGithub forks
Github repo MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising Tencent TMElyralab CodeGithub starsGithub forks
Github repo MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation Tencent CodeGithub starsGithub forks
arXiv 2024 ControlNeXt: Powerful and Efficient Control for Image and Video Generation The Chinese University of Hong Kong CodeGithub starsGithub forks stable video diffusion
[arXiv 2024] CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention Zhejiang University and ByteDance pre-trained on 200 hours video data and more than 10k unique identities

Datasets

Talking-face

Audio-Visual Datasets for Enlish Speakers
Dataset name Environment Year Resolution Subject Duration Sentence
VoxCeleb1 Wild 2017 360p~720p 1251 352 hours 100k
VoxCeleb2 Wild 2018 360p~720p 6112 2442 hours 1128k
HDTF Wild 2020 720p~1080p 300+ 15.8 hours
LSP Wild 2021 720p~1080p 4 18 minutes 100k
Audio-Visual Datasets for Chinese Speakers
Dataset name Environment Year Resolution Subject Duration Sentence
CMLR Lab 2019 11 102k
MAVD Lab 2023 1920x1080 64 24 hours 12k
CN-Celeb Wild 2020 3000 1200 hours
CN-Celeb-AV Wild 2023 1136 660 hours
CN-CVS Wild 2023 2500+ 300+ hours

Metrics

Talking-face

Lip-Sync
Metric name Description Code/Paper
LMD↓ Mouth landmark distance
LMD↓ Mouth landmark distance
MA↑ The Insertion-over-Union (IoU) for the overlap between the predicted mouth area and the ground truth area
Sync↑ The confidence score from SyncNet (Sync) wav2lip
LSE-C↑ Lip Sync Error - Confidence wav2lip
LSE-D↓ Lip Sync Error - Distance wav2lip
Image Quality (identity preserving)
Metric name Description Code/Paper
MAE↓ Mean Absolute Error metric for image mmagic
MSE↓ Mean Squared Error metric for image mmagic
PSNR↑ Peak Signal-to-Noise Ratio mmagic
SSIM↑ Structural similarity for image mmagic
FID↓ Frchet Inception Distance mmagic
IS↑ Inception score mmagic
NIQE↓ Natural Image Quality Evaluator metric mmagic
CSIM↑ The cosine similarity of identity embedding InsightFace
CPBD↑ The cumulative probability blur detection python-cpbd
Diversity
Metric name Description Code/Paper
Diversity of head motions↑ A standard deviation of the head motion feature embeddings extracted from the generated frames using Hopenet (Ruiz et al., 2018) is calculated SadTalker
Beat Align Score↑ The alignment of the audio and generated head motions is calculated in Bailando (Siyao et al., 2022) SadTalker

Toolbox

  1. A general toolbox for AIGC, including common metrics and models https://github.com/open-mmlab/mmagic
  2. face3d: Python tools for processing 3D face https://github.com/yfeng95/face3d
  3. 3DMM model fitting using Pytorch https://github.com/ascust/3DMM-Fitting-Pytorch
  4. OpenFace: a facial behavior analysis toolkit https://github.com/TadasBaltrusaitis/OpenFace
  5. autocrop: Automatically detects and crops faces from batches of pictures https://github.com/leblancfg/autocrop
  6. OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation https://github.com/CMU-Perceptual-Computing-Lab/openpose
  7. GFPGAN: Practical Algorithm for Real-world Face Restoration https://github.com/TencentARC/GFPGAN
  8. CodeFormer: Robust Blind Face Restoration https://github.com/sczhou/CodeFormer
  9. metahuman-stream: Real time interactive streaming digital human https://github.com/lipku/metahuman-stream
  10. EasyVolcap: a PyTorch library for accelerating neural volumetric video research https://github.com/zju3dv/EasyVolcap
  11. 3D Model in gradio https://www.gradio.app/guides/how-to-use-3D-model-component

Automatic Speech Recognition (ASR)

  1. BELLE-2/Belle-whisper-large-v3-zh https://huggingface.co/BELLE-2/Belle-whisper-large-v3-zh
  2. SenseVoice (multilingual) https://github.com/FunAudioLLM/SenseVoice 👍👍

Text to Speech (TTS)

  1. CosyVoice, Alibaba Tongyi SpeechTeam https://github.com/FunAudioLLM/CosyVoice 👍👍
  2. FireRedTTS, FireReadTeam https://github.com/FireRedTeam/FireRedTTS
  3. GPT-SoVITS https://github.com/RVC-Boss/GPT-SoVITS?tab=readme-ov-file

Speech to Speech (GPT4-o)

  1. Mini-Omni, Tsinghua University https://github.com/gpt-omni/mini-omni
  2. Speech To Speech, HuggingFace https://github.com/huggingface/speech-to-speech

Related Links

If you are interested in avatar and digital human, we would also like to recommend you to check out other related collections: