In this work, we present UniG, a view-consistent 3D reconstruction and novel view synthesis model that generates a high-fidelity representation of 3D Gaussians from sparse images. Existing 3D Gaussians-based methods usually regress Gaussians per-pixel of each view, create 3D Gaussians per view separately, and merge them through point concatenation. Such a view-independent reconstruction approach often results in a view inconsistency issue, where the predicted positions of the same 3D point from different views may have discrepancies. To address this problem, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as decoder queries and updates their parameters layer by layer by performing multi-view cross-attention (MVDFA) over multiple input images. In this way, multiple views naturally contribute to modeling a unitary representation of 3D Gaussians, thereby making 3D reconstruction more view-consistent. Moreover, as the number of 3D Gaussians used as decoder queries is irrespective of the number of input views, allow an arbitrary number of input images without causing memory explosion. Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively.
在这项工作中,我们提出了UniG,这是一种视角一致的3D重建和新视角合成模型,可以从稀疏图像中生成高保真的3D高斯表示。现有基于3D高斯的方法通常针对每个视图的每个像素回归高斯,在每个视图上分别创建3D高斯,并通过点拼接进行合并。这种视图独立的重建方法常常导致视角不一致的问题,即不同视图下预测的同一个3D点的位置可能存在偏差。为了解决这个问题,我们开发了一个类似DETR(DEtection TRansformer)的框架,将3D高斯视为解码器查询,通过在多个输入图像上执行多视图交叉注意力(MVDFA),逐层更新其参数。通过这种方式,多个视图自然共同作用于3D高斯的单一表示,从而使3D重建更加视角一致。此外,作为解码器查询使用的3D高斯数量与输入视图数量无关,这允许任意数量的输入图像而不会导致内存爆炸。大量实验验证了我们方法的优势,展示了在现有方法上定量(在Objaverse上训练并在GSO基准上测试时,PSNR提高了4.2 dB)和定性方面的优越表现。