In recent years, natural language processing (NLP) has seen immense progress through the development of large language models (LLMs) like GPT-3, which can generate human-like text and engage in natural conversations. However, a major limitation of these models is that they operate solely in the text domain - they cannot perceive or reason about visual inputs like images. Enabling LLMs to handle multimodal visual tasks alongside language would be a major leap toward more capable artificial intelligence.

A new paper from researchers at Tsinghua University, Tencent AI Lab, and the Chinese University of Hong Kong introduces an exciting new method that can efficiently teach existing LLMs to utilize visual tools and models for comprehending and generating images. Their approach, called GPT4Tools, demonstrates that we may not need proprietary models with inaccessible architectures and datasets to impart visual capabilities to language models. This could expand the horizons of what is possible with LLMs using available resources.

Why Visual Grounding Matters for Language Models

Up until now, advanced proprietary models like ChatGPT have shown impressive performance on language tasks through self-supervision on vast amounts of text data. However, their understanding is entirely based on statistical patterns in language corpora - they have no way to ground symbols and words to real-world visual concepts.

This limits their reasoning ability and makes them unreliable for tasks that require connecting language to vision. For instance, they cannot determine if a statement like "the balloon is green" is true or not without seeing the actual image of the balloon. Nor can they generate new reasonable images from textual descriptions.

Grounding language in vision gives much-needed common sense and semantics to models. It also unlocks countless new applications at the intersection of computer vision and NLP, like visual question answering, image captioning, multimodal search, assistive technology for the visually impaired, and more.

That's why techniques to impart visual capabilities to LLMs could be transformative. The GPT4Tools approach demonstrates an efficient way to do this with available resources.

Overview of Technical Approach

The researchers' key insight is to leverage advanced LLMs like ChatGPT as powerful "teacher models" to provide visual grounding data for training other LLMs.

Here are the main steps of their technical approach:

  1. They first prompt ChatGPT with image captions and tool definitions to generate a dataset of 41k instruction-response pairs that use 23 defined visual tools (for tasks like segmentation, depth prediction, etc.).

  2. Next, they augment this data by adding negative samples (to avoid overfitting on tool usage) and context samples (for complete conversations).

  3. They use the collected 3-part dataset to fine-tune existing smaller LLMs like Vicuna and OPT by keeping the base model frozen and only updating the rank-decomposition components using a technique called Low-Rank Adaptation. This allows efficient specialization for tool usage without forgetting language capabilities.

  4. They propose metrics to evaluate whether models can accurately determine when tools should be used, choose the right tools, and provide proper arguments. Test sets are constructed with seen and unseen tools.

This approach allows repurposing the knowledge of advanced LLMs to rapidly impart visual grounding abilities to other models without extensive training resources.

Key Results

The experiments validate that GPT4Tools can successfully teach existing LLMs to handle visual tasks in a zero-shot manner:

Some limitations are that the success rate is not yet 100%, and verbose prompting for tools reduces efficiency. But overall, these results are very promising.

Implications

The GPT4Tools method provides an exciting direction for imbuing visual capabilities in existing LLMs without requiring expensive training of massive models on inaccessible proprietary data.

Some key implications are:

Of course, there are still challenges to address. The success rate needs to improve, prompting efficiency should be enhanced, and advances must translate to benefits for real users. But overall, enabling LLMs to dynamically invoke tools and skills could be a crucial milestone towards more general artificial intelligence.

Conclusion

The GPT4Tools approach demonstrates that self-instruction from advanced teacher LLMs provides an efficient way to impart visual capabilities to existing language models without extensive proprietary resources. The techniques could open up new possibilities at the intersection of vision and language AI. Through instructional learning and tool invocation, language models may learn to dynamically acquire skills as needed for multimodal reasoning and generation.


Also published here.