Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we extend the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian "marbles", reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to guide the optimization towards solutions with coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.
高斯散射已成为新视角合成的流行表示方法,展示了其在效率、光度质量和组合可编辑性方面的明显优势。继其成功之后,许多工作将高斯扩展到了4D,表明动态高斯在保持这些优势的同时,还能比其他替代表示方法更好地追踪场景几何。然而,这些方法假设使用密集的多视角视频作为监督,限制了它们在受控捕捉环境中的使用。在本工作中,我们将高斯场景表示的能力扩展到随意捕获的单目视频。我们展示了现有的4D高斯方法在这种设置中戏剧性地失败,因为单目设置是欠约束的。基于这一发现,我们提出了动态高斯弹珠(DGMarbles),包括针对单目设置困难的三个核心修改。首先,DGMarbles使用各向同性的高斯“弹珠”,减少了每个高斯的自由度,并将优化限制在关注运动和外观而非局部形状上。其次,DGMarbles采用了一个层次化的分而治之学习策略,引导优化朝向具有连贯运动的解决方案。最后,DGMarbles在优化中加入了图像级和几何级先验,包括利用最近在点追踪方面的进展的追踪损失。通过这些方式约束优化,DGMarbles学习到的高斯轨迹使得新视角渲染成为可能,并准确捕捉了场景元素的3D运动。我们在(单目的)Nvidia Dynamic Scenes数据集和Dycheck iPhone数据集上进行评估,并显示DGMarbles在质量上显著优于其他高斯基线,并与非高斯表示相当,同时保持了高斯的效率、组合性、可编辑性和追踪优势。