sia.hackernoon.com

Authors:

(1) Samson Yu, Dept. of Computer Science, National University of Singapore ([email protected]);

(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;

(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;

(4) Jiafei Duan, University of Washington;

(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute ([email protected]).

Table of Links

Abstract and I. Introduction
II. Related Work
III. PhysiClear - Tactile and Physical Understanding Training & Evaluation Suite
IV. Octopi - Vision-Language Property-Guided Physical Reasoning
V. Experimental Setup
VI. Experimental Results
VII. Ablations
VIII. Conclusion and Discussion, Acknowledgements, and References
- Appendix for Octopi: Object Property Reasoning with Large Tactile-Language Models
  - APPENDIX A: ANNOTATION DETAILS
  - APPENDIX B: OBJECT DETAILS
  - APPENDIX C: PROPERTY STATISTICS
  - APPENDIX D: SAMPLE VIDEO STATISTICS
  - APPENDIX E: ENCODER ANALYSIS
  - APPENDIX F: PG-INSTRUCTBLIP AVOCADO PROPERTY PREDICTION

V. EXPERIMENTAL SETUP

In this section, we evaluate the physical property prediction and reasoning capabilities of our proposed method. We design several experiments to answer the following questions:

Are our physical property predictions useful for OCTOPI to reason about everyday scenarios?
Can OCTOPI be used in real robots to help them accomplish tasks using physical reasoning?
Can OCTOPI’s understanding of the physical properties generalize to the unseen daily life objects?

A. Data Processing

The tactile videos were processed into frames. To focus on a few salient frames for better efficiency, we selected frames that have the top 30% total pixel intensity difference with their preceding frames. We randomly sampled 5 frames from these salient frames during training and selected 5 frames at uniform intervals from the first salient frame during evaluation. Data augmentation was performed during training in the form of random horizontal and vertical flips with 50% chance for each flip.

B. Training Hyperparameters

C. Training Requirements

Encoder fine-tuning took 6 hours and required less than 5GB of GPU VRAM. Tactile feature alignment together with end-to-end fine-tuning took 5 hours for OCTOPI-7b and 6.5 hours for OCTOPI-13b. We used 1 NVIDIA RTX A6000 for OCTOPI-7b and 2 NVIDIA RTX A6000s for OCTOPI-13b.

This paper is available on arxiv under CC BY 4.0 DEED license.

This AI Learns to Handle the Unknown—By Touch Alone

Table of Links

V. EXPERIMENTAL SETUP