Existing sparse-view reconstruction models heavily rely on accurate known camera poses. However, deriving camera extrinsics and intrinsics from sparse-view images presents significant challenges. In this work, we present FreeSplatter, a highly scalable, feed-forward reconstruction framework capable of generating high-quality 3D Gaussians from uncalibrated sparse-view images and recovering their camera parameters in mere seconds. FreeSplatter is built upon a streamlined transformer architecture, comprising sequential self-attention blocks that facilitate information exchange among multi-view image tokens and decode them into pixel-wise 3D Gaussian primitives. The predicted Gaussian primitives are situated in a unified reference frame, allowing for high-fidelity 3D modeling and instant camera parameter estimation using off-the-shelf solvers. To cater to both object-centric and scene-level reconstruction, we train two model variants of FreeSplatter on extensive datasets. In both scenarios, FreeSplatter outperforms state-of-the-art baselines in terms of reconstruction quality and pose estimation accuracy. Furthermore, we showcase FreeSplatter's potential in enhancing the productivity of downstream applications, such as text/image-to-3D content creation.
现有的稀疏视图重建模型在很大程度上依赖于准确的已知相机位姿。然而,从稀疏视图图像中提取相机的外参和内参面临重大挑战。为此,我们提出 FreeSplatter,一个高度可扩展的前馈重建框架,能够从未校准的稀疏视图图像中快速生成高质量的3D高斯表示,并在短短几秒内恢复相机参数。 FreeSplatter 基于精简的 Transformer 架构,由一系列顺序的自注意力模块组成,促进多视图图像令牌之间的信息交换,并将其解码为像素级的3D高斯原语。这些预测的高斯原语位于统一的参考坐标系中,从而实现高保真度的3D建模,并使用现成的解算器即时估算相机参数。 针对对象级和场景级重建需求,我们在大规模数据集上训练了 FreeSplatter 的两种模型变体。在这两种场景下,FreeSplatter 在重建质量和位姿估计精度方面均优于当前最先进的基线方法。此外,我们展示了 FreeSplatter 在下游应用中的潜力,例如文本/图像到3D内容创建,大幅提升了相关任务的生产效率。