Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand motions.In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses.
面部表情和手部动作对表达情感和与世界互动至关重要。然而,大多数从随意捕获的视频中建模的3D人类头像仅支持身体动作而不支持面部表情和手部动作。在这项工作中,我们提出了ExAvatar,这是一种从短单目视频中学习到的具有表现力的全身3D人类头像。我们将ExAvatar设计为全身参数化网格模型(SMPL-X)和3D高斯点云(3DGS)的组合。主要挑战包括:1)视频中的面部表情和姿势多样性有限,2)缺乏3D观测,如3D扫描和RGBD图像。视频中的有限多样性使得带有新面部表情和姿势的动画变得复杂。此外,缺乏3D观测可能导致视频中未观察到的人体部分出现显著模糊,这可能在新动作下产生明显的伪影。为解决这些问题,我们引入了网格和3D高斯点云的混合表示。我们的混合表示将每个3D高斯点视为表面上的一个顶点,并根据SMPL-X的网格拓扑定义其连接信息(即三角形面)。这使得我们的ExAvatar能够通过SMPL-X的面部表情空间驱动,从而实现新面部表情的动画。此外,通过使用基于连接的正则化器,我们显著减少了新面部表情和姿势下的伪影。