Monocular object pose estimation, as a pivotal task in computer vision and robotics, heavily depends on accurate 2D-3D correspondences, which often demand costly CAD models that may not be readily available. Object 3D reconstruction methods offer an alternative, among which recent advancements in 3D Gaussian Splatting (3DGS) afford a compelling potential. Yet its performance still suffers and tends to overfit with fewer input views. Embracing this challenge, we introduce SGPose, a novel framework for sparse view object pose estimation using Gaussian-based methods. Given as few as ten views, SGPose generates a geometric-aware representation by starting with a random cuboid initialization, eschewing reliance on Structure-from-Motion (SfM) pipeline-derived geometry as required by traditional 3DGS methods. SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success. Experiments on typical benchmarks, especially on the Occlusion LM-O dataset, demonstrate that SGPose outperforms existing methods even under sparse view constraints, under-scoring its potential in real-world applications.
单目物体姿态估计作为计算机视觉和机器人学中的关键任务,严重依赖准确的 2D-3D 对应关系,这通常需要昂贵的 CAD 模型,而这些模型可能并不随时可得。物体 3D 重建方法提供了一种替代方案,其中近期在 3D 高斯点云(3DGS)领域的进展展现了极大的潜力。然而,它的性能仍受限,且在输入视角较少的情况下容易过拟合。面对这一挑战,我们引入了 SGPose,这是一个基于高斯方法的稀疏视角物体姿态估计新框架。给定少至十个视角,SGPose 通过从随机立方体初始化开始生成几何感知表示,避免了传统 3DGS 方法依赖于从运动结构(SfM)管道获得的几何结构。SGPose 通过回归图像与从稀疏输入和随机初始化中重建的模型之间的密集 2D-3D 对应关系,消除了对 CAD 模型的依赖,其中几何一致的深度监督和在线合成视图变形是其成功的关键。实验结果,特别是在 Occlusion LM-O 数据集上,表明 SGPose 在稀疏视角限制下优于现有方法,突显了其在实际应用中的潜力。