Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces. To address the above issue, we design a plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective. Specifically, TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians. Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed. Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer.
动态场景重建一直是3D视觉领域的长期挑战。近期的方法通过附加的变形场将3D高斯点扩展到动态场景,并应用显式约束(如运动流)来引导变形。然而,这些方法从单独的时间戳独立学习运动变化,这使得在重建复杂场景时面临挑战,尤其是在处理剧烈运动、极端几何形状或反射表面时。为了解决上述问题,我们设计了一个即插即用模块,称为 TimeFormer,使现有的可变形3D高斯重建方法能够从学习的角度隐式建模运动模式。 具体而言,TimeFormer 包括一个 跨时间 Transformer 编码器(Cross-Temporal Transformer Encoder),能够自适应地学习可变形3D高斯的时间关系。此外,我们提出了一种 双流优化策略,在训练阶段将 TimeFormer 学到的运动知识传递到基础流(base stream)。这样,在推理阶段可以移除 TimeFormer,从而保留原始的渲染速度。 在多视角和单目动态场景中的大量实验表明,TimeFormer 带来了定性和定量的显著改进。