Table of Links
-
Related Works
2.1. Vision-and-Language Navigation
-
3.2. Open-set Semantic Information from Images
-
Conclusion and Future Work, Disclosure statement, and References
3. Methodology
In this section, we discuss the pipeline of our Vision-Language Navigation (VLN) method, which employs O3D-SIM. We begin with an overview of our proposed pipeline and then present an in-depth analysis of its constituent steps. The initial phase of our methodology involves data collection, consisting of a set of RGB-D images and extrinsic and intrinsic camera parameters, which are outlined first. Subsequently, we move to creating the Open-set 3D Semantic Instance Map. This process is divided into two main stages: initially, we extract open-set semantic instance information from the images; following this, we utilize the gathered open-set information to organize the 3D point cloud into an open-set 3D semantic instance map. The final part of our discussion focuses on the VLN module, where we talk about its implementation and functionality.
The pipeline of the O3D-SIM creation is depicted in Fig.2. The first step of the creation of the O3D-SIM, presented in Section 3.2, is the extraction of the open-set semantic instance information from the RGB sequence of input images. This information includes, for each object instance, the mask information and the semantic features represented by the CLIP [9] and DINO [10] embedding features. The second step, presented in Section 3.3, uses this open-set semantic instance information to cluster the input 3D point cloud into an open-set semantic 3D objects map, see Figures 2 and 3. The operation is improved incrementally by applying the sequence of RGB-D images over time.
Authors:
(1) Laksh Nanwani, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;
(2) Kumaraditya Gupta, International Institute of Information Technology, Hyderabad, India;
(3) Aditya Mathur, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;
(4) Swayam Agrawal, International Institute of Information Technology, Hyderabad, India;
(5) A.H. Abdul Hafez, Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey;
(6) K. Madhava Krishna, International Institute of Information Technology, Hyderabad, India.
This paper is