Authors:
(1) Samson Yu, Dept. of Computer Science, National University of Singapore ([email protected]);
(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;
(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;
(4) Jiafei Duan, University of Washington;
(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute ([email protected]).
Table of Links
- Abstract and I. Introduction
- II. Related Work
- III. PhysiClear - Tactile and Physical Understanding Training & Evaluation Suite
- IV. Octopi - Vision-Language Property-Guided Physical Reasoning
- V. Experimental Setup
- VI. Experimental Results
- VII. Ablations
- VIII. Conclusion and Discussion, Acknowledgements, and References
- Appendix for Octopi: Object Property Reasoning with Large Tactile-Language Models
- APPENDIX A: ANNOTATION DETAILS
- APPENDIX B: OBJECT DETAILS
- APPENDIX C: PROPERTY STATISTICS
- APPENDIX D: SAMPLE VIDEO STATISTICS
- APPENDIX E: ENCODER ANALYSIS
- APPENDIX F: PG-INSTRUCTBLIP AVOCADO PROPERTY PREDICTION
- Appendix for Octopi: Object Property Reasoning with Large Tactile-Language Models
VIII. CONCLUSION AND DISCUSSION
In this work, we extended large vision-language models (LVLMs) to process and describe tactile inputs using physical properties. We proposed a tactile dataset called PHYSICLEAR, comprising data from vision (Camera) and tactile (GelSight) sensors collected from everyday objects, along with physical property annotations. We also present OCTOPI, a large tactile-language model trained using datasets like PHYSICLEAR to perform physical property reasoning using tactile inputs.
Our experiments show that OCTOPI is able to describe tactile signals from novel unseen objects and that inferred physical properties can be used for physical reasoning and robot task completion in scenarios with visual ambiguity. We studied the impact of different components in OCTOPI, and found that using a task-specific visual encoder that is fine-tuned on our labels improves performance significantly across all tasks. This suggests that improvements to the visual encoder will yield benefits. In addition, parameter-efficient LLM fine-tuning consistently improved performance.
Our work opens up future work in tactile robotics. We are currently working on tactile encoder improvement and more diverse exploratory procedures to obtain additional physical properties. It would also be interesting to combine different datasets (e.g., those using other tactile sensors [44, 45]), along with other modalities such as robot proprioception. We plan to also perform physical understanding alignment with object images and LLM fine-tuning with additional physical understanding data [52, 31].
ACKNOWLEDGEMENTS
This research is supported by the National Research Foundation, Singapore under its Medium Sized Center for Advanced Robotics Technology Innovation.
REFERENCES
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736, 2022.
[2] Stephane Aroca-Ouellette, Cory Paik, Alessandro Ron- ´ cone, and Katharina Kann. Prost: Physical reasoning of objects through space and time. arXiv preprint arXiv:2106.03634, 2021. URL https://arxiv.org/pdf/2106. 03634.pdf.
[3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. URL https://arxiv.org/pdf/2308.12966.pdf.
[4] Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. PHYRE: A New Benchmark for Physical Reasoning. 2019. URL https: //arxiv.org/pdf/1908.05656.pdf.
[5] Wouter M. Bergmann Tiest. Tactual perception of material properties. Vision Research, 50(24):2775–2782, 2010. ISSN 0042-6989. doi: https://doi.org/10.1016/j. visres.2010.10.005. URL https://www.sciencedirect.com/ science/article/pii/S0042698910004967. Perception and Action: Part I.
[6] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. URL https://arxiv.org/pdf/1911.11641.pdf.
[7] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. URL https://arxiv.org/pdf/2307.15818.pdf.
[8] Guanqun Cao, Jiaqi Jiang, Danushka Bollegala, and Shan Luo. Learn from Incomplete Tactile Data: Tactile Representation Learning with Masked Autoencoders. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10800–10805. IEEE, 2023. URL https://arxiv.org/pdf/2307.07358.pdf.
[9] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. URL https://arxiv.org/ pdf/2310.09478.pdf.
[10] X. Chen, Fei Shao, Cathy Barnes, Tom Childs, and Brian Henson. Exploring Relationships between Touch Perception and Surface Physical Properties. International Journal of Design, 3:67–76, 08 2009. URL https://arxiv.org/pdf/1704.03822.pdf.
[11] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
[12] Jiafei Duan, Samson Yu, and Cheston Tan. Space: A simulator for physical interactions and causal learning in 3d environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2058–2063, 2021.
[13] Jiafei Duan, Samson Yu, Soujanya Poria, Bihan Wen, and Cheston Tan. PIP: Physical Interaction Prediction via Mental Simulation with Span Selection. In European Conference on Computer Vision, pages 405–421. Springer, 2022. URL http://phys101.csail.mit.edu/papers/ phys101 bmvc.pdf.
[14] Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022.
[15] Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, and Ranjay Krishna. Ar2-d2: Training a robot without a robot. arXiv preprint arXiv:2306.13818, 2023.
[16] Letian Fu, Gaurav Datta, Huang Huang, William ChungHo Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232, 2024.
[17] Jensen Gao, Bidipta Sarkar, Fei Xia, Ted Xiao, Jiajun Wu, Brian Ichter, Anirudha Majumdar, and Dorsa Sadigh. Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023. URL https://arxiv.org/pdf/2309.02561.pdf.
[18] Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, and Jiajun Wu. Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10598–10608, 2022. URL https://arxiv.org/pdf/ 2204.02389.pdf.
[19] Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, and Jiajun Wu. The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17276–17286, June 2023. URL https://arxiv.org/pdf/2306.00956.pdf.
[20] Yang Gao, Lisa Anne Hendricks, Katherine J Kuchenbecker, and Trevor Darrell. Deep learning for tactile understanding from visual and haptic data. In 2016 IEEE international conference on robotics and automation (ICRA), pages 536–543. IEEE, 2016. URL https: //arxiv.org/pdf/1511.06065.pdf.
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[22] Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, and Chuang Gan. Multiply: A multisensory object-centric embodied large language model in 3d world. arXiv preprint arXiv:2401.08577, 2024.
[23] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/pdf/2106.09685.pdf.
[24] Hung-Jui Huang, Xiaofeng Guo, and Wenzhen Yuan. Understanding dynamic tactile sensing for liquid property estimation. arXiv preprint arXiv:2205.08771, 2022. URL https://arxiv.org/pdf/2205.08771.pdf.
[25] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
[26] Jiaqi Jiang and Shan Luo. Robotic perception of object properties using tactile sensing. In Tactile Sensing, Skill Learning, and Robotic Dexterous Manipulation, pages 23–44. Elsevier, 2022.
[27] Roberta Klatzky and Catherine L Reed. Haptic exploration. Scholarpedia of Touch, pages 177–183, 2016.
[28] Mark H. Lee. Tactile sensing: New directions, new challenges. The International Journal of Robotics Research, 19(7):636–643, 2000. doi: 10.1177/ 027836490001900702. URL https://doi.org/10.1177/ 027836490001900702.
[29] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. URL https://dl.acm. org/doi/10.5555/3618408.3619222.
[30] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. URL https: //arxiv.org/pdf/2305.06355.pdf.
[31] Lei Li, Jingjing Xu, Qingxiu Dong, Ce Zheng, Xu Sun, Lingpeng Kong, and Qi Liu. Can language models understand physical concepts? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11843–11861, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.726. URL https://aclanthology.org/2023.emnlp-main.726.
[32] Rui Li and Edward H. Adelson. Sensing and recognizing surface textures using a gelsight sensor. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1241–1247, 2013. doi: 10.1109/CVPR.2013.164.
[33] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
[34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. URL https://arxiv.org/pdf/2304. 08485.pdf.
[35] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[36] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424, 2023. URL https://arxiv.org/pdf/2306.05424.pdf.
[37] Andrew Melnik, Robin Schiewer, Moritz Lange, Andrei Muresanu, Mozhgan Saeidi, Animesh Garg, and Helge Ritter. Benchmarks for Physical Reasoning AI. arXiv preprint arXiv:2312.10728, 2023. URL https://arxiv.org/ pdf/2312.10728.pdf.
[38] Matthew Purri and Kristin Dana. Teaching cameras to feel: Estimating tactile physical properties of surfaces from images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 1–20. Springer, 2020. URL https://arxiv.org/pdf/2004.14487.pdf.
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. URL https://arxiv.org/pdf/2103.00020.pdf.
[40] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023.
[41] Maki Sakamoto and Junji Watanabe. Exploring Tactile Perceptual Dimensions Using Materials Associated with Sensory Vocabulary. Frontiers in Psychology, 8, 2017. URL https://api.semanticscholar.org/CorpusID: 14038261.
[42] Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. QuiltLLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos. arXiv e-prints, pages arXiv–2312, 2023. URL https://arxiv.org/pdf/2312.04746.pdf.
[43] Kuniyuki Takahashi and Jethro Tan. Deep visuo-tactile learning: Estimation of tactile properties from images. In 2019 International Conference on Robotics and Automation (ICRA), pages 8951–8957. IEEE, 2019. URL https://arxiv.org/pdf/1803.03435.pdf.
[44] Tasbolat Taunyazov, Weicong Sng, Hian Hian See, Brian Lim, Jethro Kuan, Abdul Fatir Ansari, Benjamin CK Tee, and Harold Soh. Event-driven visual-tactile sensing and learning for robots. arXiv preprint arXiv:2009.07083, 2020. URL https://arxiv.org/pdf/2009.07083.pdf.
[45] Tasbolat Taunyazov, Luar Shui Song, Eugene Lim, Hian Hian See, David Lee, Benjamin CK Tee, and Harold Soh. Extended tactile perception: Vibration sensing through tools and grasped objects. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1755–1762. IEEE, 2021.
[46] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https: //arxiv.org/pdf/2312.11805.pdf.
[47] Johan Tegin and Jan Wikander. Tactile sensing in intelligent robotic manipulation—a review. Industrial Robot: An International Journal, 32, 02 2005. doi: 10.1108/01439910510573318.
[48] Stephen Tian, Frederik Ebert, Dinesh Jayaraman, Mayur Mudigonda, Chelsea Finn, Roberto Calandra, and Sergey Levine. Manipulation by feel: Touch-based control with deep predictive models. In 2019 International Conference on Robotics and Automation (ICRA), pages 818– 824. IEEE, 2019. URL https://arxiv.org/pdf/1903.04128. pdf.
[49] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Bap- ´ tiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, ` et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[50] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. URL https: //arxiv.org/pdf/2307.09288.pdf.
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[52] Yi Ru Wang, Jiafei Duan, Dieter Fox, and Siddhartha Srinivasa. NEWTON: Are Large Language Models Capable of Physical Reasoning?
[53] Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models. arXiv preprint arXiv:2312.06109, 2023. URL https://arxiv.org/pdf/2312. 06109.pdf.
[54] Jiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics 101: Learning Physical Object Properties from Unlabeled Videos. In BMVC, volume 2, page 7, 2016. URL http://phys101.csail.mit.edu/papers/phys101 bmvc.pdf.
[55] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and TatSeng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. URL https://arxiv.org/ pdf/2309.05519.pdf.
[56] Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, and Andrew Owens. Touch and go: Learning from human-collected vision and touch. arXiv preprint arXiv:2211.12498, 2022. URL https://arxiv.org/ pdf/2211.12498.pdf.
[57] Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, et al. Binding touch to everything: Learning unified multimodal tactile representations. arXiv preprint arXiv:2401.18084, 2024.
[58] Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019. URL https://arxiv.org/pdf/1910.01442.pdf.
[59] Wenzhen Yuan, Mandayam A. Srinivasan, and Edward H. Adelson. Estimating object hardness with a gelsight touch sensor. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 208–215, 2016. doi: 10.1109/IROS.2016.7759057.
[60] Wenzhen Yuan, Siyuan Dong, and Edward H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12), 2017. ISSN 1424-8220. doi: 10.3390/s17122762. URL https://www.mdpi.com/1424-8220/17/12/2762.
[61] Wenzhen Yuan, Yuchen Mo, Shaoxiong Wang, and Edward H Adelson. Active clothing material perception using tactile sensing and deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4842–4849. IEEE, 2018. URL https: //arxiv.org/pdf/1711.00574.pdf.
[62] Ben Zandonati, Ruohan Wang, Ruihan Gao, and Yan Wu. Investigating Vision Foundational Models for Tactile Representation Learning. arXiv preprint arXiv:2305.00596, 2023. URL https://arxiv.org/pdf/2305. 00596.pdf.
[63] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
[64] Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue. Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023. URL https://arxiv.org/pdf/2307.10802.pdf.
[65] Chenchen Zhu, Fanyi Xiao, Andres Alvarado, Yas- ´ mine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Chang, Roshan Sumbaly, and Zhicheng Yan. Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
[66] Yixin Zhu, Tao Gao, Lifeng Fan, Siyuan Huang, Mark Edmonds, Hangxin Liu, Feng Gao, Chi Zhang, Siyuan Qi, Ying Nian Wu, et al. Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3):310–345, 2020.
Appendix for Octopi: Object Property Reasoning with Large Tactile-Language Models
In the appendices below, we provide more details for our PHYSICLEAR dataset, specifically for the annotations, objects, properties, sample videos and prompts. We also provide details on the OCTOPI encoder and avocado experiments.
APPENDIX A: ANNOTATION DETAILS
We developed a general set of guidelines to align the annotations for the properties and reduce variance. We show the guidelines in Table A1. While the hardness and roughness properties are generally annotated with actual haptic feedback from human exploratory procedures, the bumpiness property is generally annotated visually from the GelSight tactile images.
APPENDIX B: OBJECT DETAILS
In Tables A3 and A4, we provide the list of objects in PHYSICLEAR, their associated video name, and the set (i.e. train/val/test). We use the object names in the left column during the Property-object Matching task.
In Table A5, we list objects that have semantically meaningful parts for part-based physical reasoning tasks. Specifically, we note that their semantic parts can be used for more precise Property-object Matching. Surface materials of PHYSICLEAR objects are listed in Table A6. PHYSICLEAR contains nine different common materials: fabric, food, leather, metal, paper, plastic, rubber, silicone and wood. While we do not use these materials for any task currently, we catalog them for future extensions (e.g. material classification using properties).
APPENDIX C: PROPERTY STATISTICS
We show the object count and percentage of each property category in Table A7. There is at least 18.92% of each category for each property such that the dataset is sufficiently balanced.
Table A8 shows the count for each combination of the three properties: hardness, roughness and bumpiness. The joint distribution of these properties across everyday objects is not uniform. Some combinations are much more common (e.g. [Hard, smooth, no bumps] for human-made objects) than others (e.g., [Soft, rough, no bumps]). Furthermore, the curse of dimensionality means that it is difficult to represent all combinations well as the number of properties increases. Our future work will incorporate better representation learning methods to handle this imbalance.
APPENDIX D: SAMPLE VIDEO STATISTICS
There are five to seven tactile videos for each object in PHYSICLEAR. Table A9 shows the average, minimum and maximum number of frames in our tactile videos at 112.30, 50 and 126 respectively.
A. Language Prompt Details
We list the questions for each task in this section.
APPENDIX E: ENCODER ANALYSIS
Fig. 6 visualizes the confusion matrices for the finetuned CLIP classifier’s predictions and Fig. 7 plots its visual encoder’s output embeddings for all unseen objects (dimensionality-reduced). We observe that the CLIP classifier performs well on Hard objects but not as well on the Soft and Moderately hard objects. This is also evident in the encoder embeddings in the leftmost part of Fig. 7 where the embeddings for the Hard objects are well-separated while the ones for Moderately hard objects are especially not well-separated. For the roughness property, the CLIP classifier performs well on Smooth and Slightly rough objects but not as well on the Rough objects. We can see this in the encoder
embeddings in Fig. 7 where some of the embeddings for the Rough objects are interspersed with the embeddings for the Smooth objects. Lastly, for the bumpiness property, the CLIP is able to discriminate No bumps and Small bumps but not as well on objects with Big bumps. In the encoder embeddings at the right side of Fig. 7, the embeddings for the objects with Big bumps are interspersed with the embeddings for the objects with No bumps and Small bumps.
APPENDIX F: PG-INSTRUCTBLIP AVOCADO PROPERTY PREDICTION
For the avocado experiments with PG-InstructBLIP, we take 20 RGB images of each avocado from a top-down view on a flat gray surface. We vary the position and orientation of each avocado for each image and use cropped images (example shown in Fig. 8); the cropping of the avocado images makes it closer to the dataset PG-InstructBLIP was trained on. We use the original prompt provided in PG-InstructBLIP’s example script and prompt the model for each physical property:
• Hardness - “Classify the object as hard, soft or medium? Respond unknown if you are not sure. Short answer:”
• Roughness - “Classify the object as rough, smooth or medium? Respond unknown if you are not sure. Short answer:”
• Bumpiness - “Classify the object as having big bumps, small bumps or no bumps? Respond unknown if you are not sure. Short answer:”
PG-InstructBLIP is not trained on our three physical properties and we find that it never chooses “moderately hard” and “slightly rough”. Hence, we change those categories to “medium” for hardness and roughness.
This paper is available on arxiv under CC BY 4.0 DEED license.