Researchers Develop a Real-Time 3D Mapping System That Helps Robots Understand Natural Language

Authors:

Justin Yu
Kush Hari
Kishore Srinivas
Karim El-Refai
Adam Rashid
Chung Min Kim
Justin Kerr
Richard Cheng
Muhammad Zubair Irshad
Ashwin Balakrishna
Thomas Kollar Ken Goldberg

Table of Links

ABSTRACT

We perform a series of repeated CO2 injections in a room-scale physical model of a faulted geological cross-section. Relevant parameters for subsurface carbon sequestration, including multiphase flows, capillary CO2 trapping, dissolution, and convective mixing, are studied and quantified. As part of a forecasting benchmark study, we address and quantify six predefined metrics for storage capacity and security in typical CO2 storage operations. Using the same geometry, we investigate the degree of reproducibility of five repeated experimental runs. Our analysis focuses on physical variations of the spatial distribution of mobile and dissolved CO2, multiphase flow patterns, development in mass of the aqueous and gaseous phases, gravitational fingers, and leakage dynamics. We observe very good reproducibility in homogenous regions with up to 97 % overlap between repeated runs, and that faultrelated heterogeneity tends to decrease reproducibility. Notably, we observe an oscillating anticline CO2 leakage behavior from an open anticline with a spill point in the immediate footwall of a normal fault, and discuss the underlying causes for the observed phenomenon within the constraints of the studied system.

INTRODUCTION

Consider open vocabulary robot requests such as “Where are gluten-free crackers?” or “Get a stain remover spray”, a robot must parse such queries, localize relevant objects, and navigate to them. A large body of recent work uses large vision-language models by distilling their outputs into 3D representations like point clouds or NeRFs [2]. These semantic representations have been applied to both manipulation [3], [4], [5] and large-scale scene understanding [6], [7], [8], showing promise of using large models zero-shot for openvocabulary task specification. One key challenge for scaling these methods to large environments is the underlying 3D representation, which should be flexible to a variety of scales, able to update with new observations, and fast. Although NeRFs are commonly used as the 3D representation for distilling 2D semantic features [9], [10], [1], scaling NeRFs to large scenes can be cumbersome because they typically rely on a fixed spatial resolution [11], [12], [13], are difficult to modify, and slower to render. A popular alternative is pointclouds [14], [7], [8], [6], which work seamlessly with many SLAM algorithms. However, a given point is assigned a single color and semantic feature by fusing CLIP in the pointcloud with a contrastively supervised field, whereas a multi-scale model of the world can simultaneously reason about objects and their parts, similar

3D Gaussian Splatting (3DGS) [16] models the 3D scene using a large set of 3D Gaussians. Recent works [17], [18] successfully assign semantic features to every Gaussian in the scene. However, existing techniques combining semantic features and 3D Gaussian Splatting (3DGS) scene reconstruction require offline computation of keyframe transforms and 3D Gaussian initialization points. In this paper, we focus on linking language understanding to Gaussian Splats in large-scale scenes, while incrementally training on a stream of RGBD images of the scene from a mobile robot. This incremental training method offers substantial benefits, notably enabling the robot to autonomously determine its position within the environment and subsequently use the map data for enhanced operational efficiency. LEGS combines geometry and appearance information from 3DGS with semantic knowledge from CLIP by grounding language embeddings into the 3DGS similar to the method described in [17]. LEGS incrementally registers images and simultaneously optimizes both 3D Gaussians and dense language fields. This allows robots to build maps that contain rich representations of their surroundings that can be queried with natural language This paper makes 3 contributions: • An online multi-camera 3DGS reconstruction system for large-scale scenes. The system takes as input three video streams from a mobile robot, and incrementally builds

the 3D scene. • Language-Embedded Gaussian Splatting (LEGS), a hybrid 3D semantic representation that uses explicit 3D Gaussians for geometry and implicit scale-conditioned hashgrid [19] for the semantics. • Results from physical experiments suggesting LEGS can produce high quality Gaussian Splats in roomscale scenes with training time 3.5x faster than a LERF baseline.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.