This paper aims to tackle the problem of modeling dynamic urban street scenes from monocular videos. Recent methods extend NeRF by incorporating tracked vehicle poses to animate vehicles, enabling photo-realistic view synthesis of dynamic urban street scenes. However, significant limitations are their slow training and rendering speed, coupled with the critical need for high precision in tracked vehicle poses. We introduce Street Gaussians, a new explicit scene representation that tackles all these limitations. Specifically, the dynamic urban street is represented as a set of point clouds equipped with semantic logits and 3D Gaussians, each associated with either a foreground vehicle or the background. To model the dynamics of foreground object vehicles, each object point cloud is optimized with optimizable tracked poses, along with a dynamic spherical harmonics model for the dynamic appearance. The explicit representation allows easy composition of object vehicles and background, which in turn allows for scene editing operations and rendering at 133 FPS (1066×1600 resolution) within half an hour of training. The proposed method is evaluated on multiple challenging benchmarks, including KITTI and Waymo Open datasets. Experiments show that the proposed method consistently outperforms state-of-the-art methods across all datasets. Furthermore, the proposed representation delivers performance on par with that achieved using precise ground-truth poses, despite relying only on poses from an off-the-shelf tracker.
本文旨在解决从单目视频建模动态城市街景的问题。最近的方法通过结合跟踪的车辆姿态来扩展NeRF,以激活车辆,实现动态城市街景的逼真视角合成。然而,这些方法的显著局限性在于它们的训练和渲染速度缓慢,加上对跟踪车辆姿态高精度的关键需求。我们引入了Street Gaussians,这是一种新的显式场景表征,解决了所有这些限制。具体来说,动态城市街道被表示为一组点云,配备语义逻辑和3D高斯,每个高斯都与前景车辆或背景相关联。为了模拟前景物体车辆的动态,每个物体点云都通过可优化的跟踪姿态进行优化,同时还有一个动态球形谐波模型来表达动态外观。显式表征允许轻松组合物体车辆和背景,这反过来允许进行场景编辑操作,并在半小时的训练内以133 FPS(1066×1600分辨率)渲染。所提出的方法在包括KITTI和Waymo Open数据集在内的多个具有挑战性的基准上进行了评估。实验表明,提出的方法在所有数据集上始终优于最先进的方法。此外,尽管仅依赖于现成跟踪器的姿态,提出的表征在性能上与使用精确地面真实姿态达到的水平相当。