We propose MVSplat, an efficient feed-forward 3D Gaussian Splatting model learned from sparse multi-view images. To accurately localize the Gaussian centers, we propose to build a cost volume representation via plane sweeping in the 3D space, where the cross-view feature similarities stored in the cost volume can provide valuable geometry cues to the estimation of depth. We learn the Gaussian primitives' opacities, covariances, and spherical harmonics coefficients jointly with the Gaussian centers while only relying on photometric supervision. We demonstrate the importance of the cost volume representation in learning feed-forward Gaussian Splatting models via extensive experimental evaluations. On the large-scale RealEstate10K and ACID benchmarks, our model achieves state-of-the-art performance with the fastest feed-forward inference speed (22 fps). Compared to the latest state-of-the-art method pixelSplat, our model uses 10× fewer parameters and infers more than 2× faster while providing higher appearance and geometry quality as well as better cross-dataset generalization.
我们提出了MVSplat,这是一个从稀疏多视图图像学习得来的高效前馈3D高斯喷溅模型。为了准确地定位高斯中心,我们提出通过在3D空间中进行平面扫描来构建成本体积表示,其中存储在成本体积中的跨视图特征相似性可以为深度估计提供宝贵的几何线索。我们学习高斯原语的不透明度、协方差和球谐函数系数,同时仅依赖于光度监督与高斯中心共同进行。我们通过广泛的实验评估,展示了成本体积表示在学习前馈高斯喷溅模型中的重要性。在大规模的RealEstate10K和ACID基准测试上,我们的模型以最快的前馈推理速度(22fps)实现了最先进的性能。与最新的最先进方法pixelSplat相比,我们的模型使用了10倍更少的参数,并且推理速度快2倍以上,同时提供了更高的外观和几何质量以及更好的跨数据集泛化能力。