LIMITATIONS

We assume a static environment where objects do not move during traversal. This limits the scope of this work because many applications involve dynamic scenes with moving objects. In future work, we will adapt our method to work for dynamic scenes. The motion of the Fetch mobile base can have a large effect on the LEGS reconstruction quality; the high stiction between the robot’s caster wheels and the environment introduces jolts, causing camera pose inaccuracies and image blurs. In the future, we hope to correct this with a new mobile base where the trajectory is autonomously determined by a frontier-based exploration algorithm.

Although autonomous navigation and obstacle avoidance has been extensively studied [57], obstacles can pose a problem when it comes to the 3D Gaussian map if they are only visible in a few of the ground truth images. 3D Gaussians are initialized at the deprojected points from these few images, but there are not enough views to refine and properly train these Gaussians; the result is oddly colored floaters that obstruct some parts of the static scene. When performing natural language queries, LEGS inherits the limitations of LERF + CLIP distillation into 3D described by similar works [1]. In our experimentation, we find that a large scale environment brings additional challenges in querying, particularly in 1) small or far-field objects in the training view, 2) similar item-background color features, such as white objects on white. Language embedded Gaussian splats can also produce false-positives when querying an object that is not in the scene due to the presence of visually or semantically similar objects, which may get incorrectly classified as the query object.

CONCLUSION

In this work, we introduce Language-Embedded Gaussian Splats (LEGS), a system that can train Gaussian Splats online with CLIP embeddings for large-scale indoor scenes. Because of pose accumulation error that builds up in large scenes, we use incremental bundle adjustment to improve pose fidelity for Gaussian Splat training. Results suggest LEGS trains 3.5x faster than LERF with comparable object recall.

REFERENCES

[1] J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” in IEEE/CVF ICCV, 2023.

[2] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, 2021.

[3] Y. Ze et al., “Gnfactor: Multi-task real robot learning with generalizable neural feature fields,” in CoRL, PMLR, 2023, pp. 284–301.

[4] W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled feature fields enable few-shot language-guided manipulation,” in 7th Annual Conference on Robot Learning, 2023.

[5] A. Rashid et al., “Language embedded radiance fields for zero-shot task-oriented grasping,” in 7th Annual CoRL, 2023.

[6] K. Jatavallabhula et al., “Conceptfusion: Open-set multimodal 3d mapping,” Robotics: Science and Systems (RSS), 2023.

[7] N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, Clip-fields: Weakly supervised semantic fields for robotic memory, 2023. [8] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.

[9] S. Kobayashi, E. Matsumoto, and V. Sitzmann, “Decomposing nerf for editing via feature field distillation,” NeurIPS, vol. 35, pp. 23 311– 23 330, 2022.

[10] V. Tschernezki, I. Laina, D. Larlus, and A. Vedaldi, “Neural feature fusion fields: 3d distillation of self-supervised 2d image representations,” in 2022 3DV, IEEE, 2022.

[11] A. Meuleman et al., “Progressively optimized local radiance fields for robust view synthesis,” in Proceedings of the IEEE/CVF CVPR, 2023, pp. 16 539–16 548.

[12] P. Wang et al., “F2-nerf: Fast neural radiance field training with free camera trajectories,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4150–4159.

[13] M. Tancik et al., “Block-nerf: Scalable large scene neural view synthesis,” in CVPR, 2022.

[14] S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, Openscene: 3d scene understanding with open vocabularies, 2023.

[15] M. Bajracharya et al., “Demonstrating mobile manipulation in the wild: A metrics-driven approach,” RSS, 2023.

[16] B. Kerbl, G. Kopanas, T. Leimkuhler, and G. Drettakis, “3d gaussian ¨ splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, 2023.

[17] X. Zuo, P. Samangouei, Y. Zhou, Y. Di, and M. Li, Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding, 2024.

[18] M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, Langsplat: 3d language gaussian splatting, 2024.

[19] T. Muller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics ¨ primitives with a multiresolution hash encoding,” ACM Trans. Graph., vol. 41, no. 4, 102:1–102:15, Jul. 2022.

[20] K. O. Arras, “Feature-based robot navigation in known and unknown environments,” 2003.

[21] R. Chatila and J. Laumond, “Position referencing and consistent world modeling for mobile robots,” in Proceedings. 1985 IEEE International Conference on Robotics and Automation, IEEE, vol. 2, 1985, pp. 138–145.

[22] G. Jiang, L. Yin, S. Jin, C. Tian, X. Ma, and Y. Ou, “A simultaneous localization and mapping (slam) framework for 2.5 d map building based on low-cost lidar and vision fusion,” Applied Sciences, vol. 9, no. 10, p. 2105, 2019.

[23] H. Choset and K. Nagatani, “Topological simultaneous localization and mapping (slam): Toward exact localization without explicit localization,” IEEE Transactions on robotics and automation, vol. 17, no. 2, pp. 125–137, 2001.

[24] A. Tapus, “Topological slam: Simultaneous localization and mapping with fingerprints of places,” 2005.

[25] B. Alsadik and S. Karam, “The simultaneous localization and mapping (slam)-an overview,” Journal of Applied Science and Technology Trends, vol. 2, no. 02, pp. 147–158, 2021.

[26] S. Kohlbrecher, O. Von Stryk, J. Meyer, and U. Klingauf, “A flexible and scalable slam system with full 3d motion estimation,” in 2011 IEEE international symposium on safety, security, and rescue robotics, IEEE, 2011, pp. 155–160.

[27] W. Hess, D. Kohler, H. Rapp, and D. Andor, “Real-time loop closure in 2d lidar slam,” in 2016 ICRA, 2016.

[28] L. Huang, “Review on lidar-based slam techniques,” in 2021 International Conference on Signal Processing and Machine Learning (CONF-SPML), IEEE, 2021, pp. 163–168.

[29] M. T. Lazaro, R. Capobianco, and G. Grisetti, “Efficient long-term ´ mapping in dynamic environments,” in 2018 IROS, IEEE, 2018.

[30] Z. Zhu et al., “Nice-slam: Neural implicit scalable encoding for slam,” in Proceedings of the IEEE/CVF CVPR, 2022.

[31] A. Rosinol, J. J. Leonard, and L. Carlone, “Nerf-slam: Real-time dense monocular slam with neural radiance fields,” in 2023 IROS, IEEE, 2023.

[32] L. Roldao, R. De Charette, and A. Verroust-Blondet, “3d semantic scene completion: A survey,” International Journal of Computer Vision, vol. 130, no. 8, pp. 1978–2005, 2022.

[33] A. Nuchter and J. Hertzberg, “Towards semantic maps for mobile ¨ robots,” Robotics and Autonomous Systems, vol. 56, no. 11, 2008.

[34] H. A. Kestler et al., “Concurrent object identification and localization for a mobile robot,” Kunstliche Intelligenz ¨ , vol. 14, no. 4, pp. 23–29, 2000.

[35] K. Genova et al., “Learning 3d semantic segmentation with only 2d image supervision,” in 2021 International Conference on 3D Vision (3DV), IEEE, 2021, pp. 361–372.

[36] V. Vineet et al., “Incremental dense semantic stereo fusion for largescale semantic scene reconstruction,” in 2015 ICRA, IEEE, 2015.

[37] A. Brohan et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in Conference on robot learning, PMLR, 2023.

[38] A. Brohan et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.

[39] B. Zitkovich et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in CoRL, PMLR, 2023, pp. 2165–2183.

[40] K. M. Jatavallabhula et al., “Conceptfusion: Open-set multimodal 3d mapping,” RSS, 2023. [41] A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, PMLR, 2021, pp. 8748–8763.

[42] Q. Gu et al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv, 2023.

[43] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” 2023.

[44] N. Keetha et al., “Splatam: Splat, track & map 3d gaussians for dense rgb-d slam,” CVPR, 2023.

[45] M. Li, S. Liu, H. Zhou, G. Zhu, N. Cheng, and H. Wang, Sgs-slam: Semantic gaussian splatting for neural dense slam, 2024.

[46] T. Chen, O. Shorinwa, W. Zeng, J. Bruno, P. Dames, and M. Schwager, Splat-nav: Safe real-time robot navigation in gaussian splatting maps, 2024.

[47] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in (CVPR), IEEE, 2017.

[48] C. Yeshwanth, Y.-C. Liu, M. Nießner, and A. Dai, Scannet++: A high-fidelity dataset of 3d indoor scenes, 2023.

[49] T. Schops, T. Sattler, and M. Pollefeys, “BAD SLAM: Bundle ¨ adjusted direct RGB-D SLAM,” in CVPR, 2019.

[50] X. Zhou, Z. Lin, X. Shan, Y. Wang, D. Sun, and M.-H. Yang, Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes, 2024.

[51] S. Agarwal et al., “Building rome in a day,” Communications of the ACM, vol. 54, no. 10, 2011.

[52] M. Tancik et al., “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH 2023, 2023, pp. 1– 12.

[53] Z. Teed and J. Deng, Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras, 2021.

[54] K. Shankar, M. Tjersland, J. Ma, K. Stone, and M. Bajracharya, “A learned stereo depth system for robotic manipulation in homes,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.

[55] J.-C. Shi, M. Wang, H.-B. Duan, and S.-H. Guan, “Language embedded 3d gaussians for open-vocabulary scene understanding,” arXiv preprint arXiv:2311.18482, 2023.

[56] A. Topiwala, P. Inani, and A. Kathpal, “Frontier based exploration for autonomous robot,” arXiv preprint arXiv:1806.03581, 2018.

[57] A. Pandey, S. Pandey, and D. Parhi, “Mobile robot navigation and obstacle avoidance techniques: A review,” Int Rob Auto J, vol. 2, no. 3, p. 00 022, 2017.

Authors:

  1. Justin Yu
  2. Kush Hari
  3. Kishore Srinivas
  4. Karim El-Refai
  5. Adam Rashid
  6. Chung Min Kim
  7. Justin Kerr
  8. Richard Cheng
  9. Muhammad Zubair Irshad
  10. Ashwin Balakrishna
  11. Thomas Kollar Ken Goldberg

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.