Authors:

(1) Samson Yu, Dept. of Computer Science, National University of Singapore ([email protected]);

(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;

(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;

(4) Jiafei Duan, University of Washington;

(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute ([email protected]).

V. EXPERIMENTAL SETUP

In this section, we evaluate the physical property prediction and reasoning capabilities of our proposed method. We design several experiments to answer the following questions:

  1. Are our physical property predictions useful for OCTOPI to reason about everyday scenarios?

  2. Can OCTOPI be used in real robots to help them accomplish tasks using physical reasoning?

  3. Can OCTOPI’s understanding of the physical properties generalize to the unseen daily life objects?

A. Data Processing

The tactile videos were processed into frames. To focus on a few salient frames for better efficiency, we selected frames that have the top 30% total pixel intensity difference with their preceding frames. We randomly sampled 5 frames from these salient frames during training and selected 5 frames at uniform intervals from the first salient frame during evaluation. Data augmentation was performed during training in the form of random horizontal and vertical flips with 50% chance for each flip.

B. Training Hyperparameters

C. Training Requirements

Encoder fine-tuning took 6 hours and required less than 5GB of GPU VRAM. Tactile feature alignment together with end-to-end fine-tuning took 5 hours for OCTOPI-7b and 6.5 hours for OCTOPI-13b. We used 1 NVIDIA RTX A6000 for OCTOPI-7b and 2 NVIDIA RTX A6000s for OCTOPI-13b.

This paper is available on arxiv under CC BY 4.0 DEED license.