Skip to content

Latest commit

 

History

History
5 lines (3 loc) · 2.75 KB

2409.12518.md

File metadata and controls

5 lines (3 loc) · 2.75 KB

Hi-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting

We propose Hi-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our Hi-SLAM outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it exhibits competitive performance in rendering semantic segmentation in small synthetic scenes, with significantly reduced storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability.

我们提出了Hi-SLAM,一种语义3D高斯散点SLAM方法,具有新颖的分层类别表示,使得能够实现精确的全局3D语义映射、可扩展能力以及在3D世界中的显式语义标签预测。随着环境复杂度的增加,语义SLAM系统中的参数使用显著增长,使得场景理解变得尤为具有挑战性和高成本。为了解决这一问题,我们引入了一种新颖的分层表示,将语义信息以紧凑的形式编码到3D高斯散点中,同时利用了大规模语言模型(LLMs)的能力。我们进一步提出了一种新的语义损失,通过层内和跨层优化来优化分层语义信息。此外,我们增强了整个SLAM系统,提升了跟踪和映射性能。我们的Hi-SLAM在映射和跟踪精度上优于现有的密集SLAM方法,同时实现了2倍的操作速度提升。此外,它在小型合成场景中的语义分割渲染表现出色,显著减少了存储和训练时间需求。渲染帧率在包含语义信息的情况下达到令人印象深刻的2000 FPS,而不包含语义信息时则达到3000 FPS。最值得注意的是,它展示了处理超过500种语义类别的复杂真实场景的能力,突显了其强大的扩展能力。