Reconstructing photo-realistic animatable human avatars from monocular videos remains challenging in computer vision and graphics. Recently, methods using 3D Gaussians to represent the human body have emerged, offering faster optimization and real-time rendering. However, due to ignoring the crucial role of human body semantic information which represents the intrinsic structure and connections within the human body, they fail to achieve fine-detail reconstruction of dynamic human avatars. To address this issue, we propose SG-GS, which uses semantics-embedded 3D Gaussians, skeleton-driven rigid deformation, and non-rigid cloth dynamics deformation to create photo-realistic animatable human avatars from monocular videos. We then design a Semantic Human-Body Annotator (SHA) which utilizes SMPL's semantic prior for efficient body part semantic labeling. The generated labels are used to guide the optimization of Gaussian semantic attributes. To address the limited receptive field of point-level MLPs for local features, we also propose a 3D network that integrates geometric and semantic associations for human avatar deformation. We further implement three key strategies to enhance the semantic accuracy of 3D Gaussians and rendering quality: semantic projection with 2D regularization, semantic-guided density regularization and semantic-aware regularization with neighborhood consistency. Extensive experiments demonstrate that SG-GS achieves state-of-the-art geometry and appearance reconstruction performance.
从单目视频重建逼真的可动画人类化身在计算机视觉和图形学领域仍然具有挑战性。最近,使用3D高斯来表示人体的方法出现了,提供了更快的优化和实时渲染。然而,由于忽略了人体语义信息的关键作用——这代表了人体内部的固有结构和连接性,这些方法未能实现动态人类化身的精细细节重建。为了解决这一问题,我们提出了SG-GS方法,使用嵌入语义的3D高斯、骨架驱动的刚性变形以及非刚性的布料动态变形,从单目视频中创建逼真的可动画人类化身。我们设计了一个语义人体标注器(SHA),利用SMPL的语义先验进行高效的人体部位语义标注。生成的标签用于指导高斯语义属性的优化。为了解决点级MLP在局部特征方面接收域有限的问题,我们还提出了一个3D网络,将几何和语义关联整合用于人类化身的变形。我们进一步实施了三种关键策略来增强3D高斯的语义准确性和渲染质量:2D正则化的语义投影、语义引导的密度正则化和具有邻域一致性的语义感知正则化。大量实验表明,SG-GS在几何和外观重建性能上达到了当前最先进的水平。