Authors:

(1) Samson Yu, Dept. of Computer Science, National University of Singapore ([email protected]);

(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;

(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;

(4) Jiafei Duan, University of Washington;

(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute ([email protected]).

IV. OCTOPI - VISION-LANGUAGE PROPERTY-GUIDED PHYSICAL REASONING

The OCTOPI framework comprises three trained components: 1) tactile input encoder, 2) projection module, and 3) LLM, similar to prior LVLM models [34, 36, 63]. A summary of our overall framework is shown in Fig. 3.

We leverage the capabilities of pre-trained vision models, notably the CLIP [39] visual encoder ViT-L/14, as the foundation for our tactile encoder to derive meaningful feature representations. The encoder’s output is then mapped to the LLM’s word embedding space using a projection module, typically consisting of one or two trainable layers. Our projection module, inspired by LLaVA [34, 33], employs two linear layers with an intermediate GELU activation [21]. Lastly, the LLM serves as the language understanding component in OCTOPI. The performance of the LLM is largely influenced by its pretraining datasets. We utilize the open-source LLaMA-based LLM, Vicuna [11], recognized for its dialogue capabilities.

The inference process is illustrated in Fig. 3. OCTOPI receives an instruction to evaluate the physical properties of

uncooked rice. The text is tokenized and fed into the LLM’s word embedding layer to produce word [W] embeddings. A sequence of five tactile images is processed through the tactile encoder, with the output embeddings sent to the projection module to obtain the final tactile [T] embeddings. Newly trained word embeddings, represented by and , mark the beginning and end of the tactile data, respectively. These tactile embeddings are then merged with the word embeddings at designated positions to form the final instruction embeddings for the LLM.

We follow a three-step training methodology: (i) encoder fine-tuning, (ii) tactile feature alignment, and (iii) end-to-end fine-tuning. In the following, we describe each of these steps in greater detail.

A. Encoder Fine-tuning

Existing LVLM models take natural videos as input and can use CLIP’s visual encoder without modification. However, our work involves vision-based tactile inputs, which marks a significant distribution shift from natural images, necessitating additional fine-tuning to derive useful representations from these inputs.

We fine-tune our visual encoder to obtain useful representations from tactile inputs using multitask physical property classification. We adopt the architecture of ViFi-CLIP [40] so that our visual encoder can be trained on video inputs. In ViFiCLIP, frame-level embeddings from CLIP’s visual encoder are average-pooled to obtain a video-level representation.

We then append learnable prompts to the pre-trained CLIP visual encoder ViT-L/14 following Visual Prompt Tuning (VPT) [25] and initialize ViFi-CLIP’s visual encoder with the new visual encoder. Specifically, we attach 8 task-specific learnable prompts and a shared linear layer to the input sequence of each Transformer [51] layer in the visual encoder and freeze the entire pre-trained Transformer backbone.

Finally, we add three separate classification heads to ViFiCLIP, each of which predicts a label for one property (i.e. hardness, roughness or bumpiness), and train all three classification heads simultaneously using the cross-entropy loss. The model achieving the highest combined validation accuracy — correctly predicting all three properties for an object — is selected.

B. Tactile Feature Alignment

We discard the fine-tuned CLIP’s classification layers and use the outputs from its visual encoder as output embeddings. To align the output embeddings from the fine-tuned visual encoder with the LLM, the projection module is trained on language annotations while the encoder and the LLM are frozen. We also fine-tune the embedding layer due to the two new word tokens (i.e. and ).

C. End-to-end Fine-tuning

Finally, we used end-to-end fine-tuning to improve the coherence of the LLM’s responses and increase the similarity between its responses and the language annotations. In this stage, only the visual encoder is frozen while the word embedding layer, projection module, and LLM are fine-tuned. We fine-tune the LLM using low-rank adaptation (LoRA) [23] for parameter-efficient fine-tuning.

This paper is available on arxiv under CC BY 4.0 DEED license.