Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 3.02 KB

2406.19811.md

File metadata and controls

8 lines (5 loc) · 3.02 KB

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on mapping 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models.

人类活动本质上复杂,即使是简单的家庭任务也涉及大量的物体交互。为了更好地理解这些活动和行为,关键在于模拟它们与环境的动态交互。近年来,廉价的头戴摄像机和自我中心数据的可用性提供了在3D环境中理解动态人-物交互更为可靠和高效的手段。然而,大多数现有的人类活动建模方法要么专注于重建手-物体或人-场景交互的3D模型,要么专注于映射3D场景,忽视了对物体的动态交互。现有的解决方案通常需要来自多个来源的输入,包括多摄像头设置、深度感知摄像头或动态传感器。 为此,我们介绍了EgoGaussian,这是第一种能够仅通过RGB自我中心输入同时重建3D场景和动态跟踪3D物体运动的方法。我们利用了高斯斑点的独特离散特性,并从背景中分割动态交互。我们的方法采用了一个剪辑级在线学习管道,利用人类活动的动态特性,允许我们按时间顺序重建场景的时间演变并跟踪刚性物体运动。此外,我们的方法自动分割物体和背景高斯函数,为静态场景和动态物体提供3D表示。 EgoGaussian在野外视频中的表现优于先前的NeRF和动态高斯方法,我们还在质量上展示了重建模型的高质量特性。