Recent sparse multi-view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art.
近年来,稀疏多视图场景重建技术(如 DUSt3R 和 MASt3R)已经不再依赖相机校准和相机位姿估计。然而,这些方法每次仅处理一对视图,通过推断像素对齐的点图来进行重建。当处理多个视图时,通常需要对成对视图进行组合数级别的重建,这种方法容易产生误差,并且需要昂贵的全局优化来校正这些成对重建的误差。然而,全局优化往往难以有效解决这些误差问题。 为处理更多视图、减少误差并提高推理速度,我们提出了单阶段前馈网络 MV-DUSt3R。其核心是多视图解码块(multi-view decoder blocks),能够在参考视图的基础上与任意数量的视图交换信息。为增强对参考视图选择的鲁棒性,我们进一步提出 MV-DUSt3R+,通过跨参考视图块(cross-reference-view blocks)在不同参考视图之间融合信息。 此外,为实现新视角合成(novel view synthesis),我们扩展了这两种模型,加入并联合训练高斯投影(Gaussian splatting)模块。实验表明,在多视图立体重建、多视图位姿估计和新视角合成任务中,MV-DUSt3R 和 MV-DUSt3R+ 显著优于现有方法,推动了多视图场景重建的性能与效率的新高度。