Table of Links
4 Results and 4.1 Increasing number of demonstrating examples
4.2 Impact of batching queries
A. Prompts used for ICL experiments
C. GPT4(V)-Turbo performance under many-shot ICL
D. Performance of many-shot ICL on medical QA tasks
Acknowledgments and Disclosure of Funding
3 Methods
We conduct several experiments to test the effect of increasing the number of demonstrating examples on the performance of two state-of-the-art multimodal foundation models: GPT-4o and Gemini 1.5 Pro (Section 3.1). We benchmark their performance using standard performance metrics as well as an ICL data efficiency metric (Section 3.3) on 10 datasets spanning several vision domains and image classification tasks (Section 3.2). We conduct ablation studies to test the impact of batching queries on model performance and explain the substantial improvement in zero-shot settings (Section 4.2). We refer to the many-shot in-context learning framework as many-shot ICL. Figure 1 provides an illustrative summary of many-shot ICL and batched many-shot ICL compared to zero-shot and few-shot ICL.
3.1 Models
We use three state-of-the-art multimodal foundation models with public API access, namely GPT-4o, GPT4(V)-Turbo [4], and Gemini 1.5 Pro [16]. Because GPT-4o performs substantially better than GPT4(V)-Turbo, we focus on the results of GPT-4o and Gemini 1.5 Pro in the main text, and include GPT4(V)-Turbo results in the Appendix. We do not utilize Claude3-Opus in our experiments, as it only accepts up to 20 images in one request at the time of writing. The specific endpoint for for GPT-4o is “gpt-4o-2024-05-13”, for GPT-4(V)-Turbo is “gpt-4-turbo-2024-04-09”, and for Gemini 1.5 Pro is “gemini-1.5-pro-preview-0409”. We use the API service provided by OpenAI for GPT-4o and GPT-4(V)-Turbo, and the API service provided by Google Cloud on Vertex AI for Gemini 1.5 Pro. We set the temperature to zero for all models and a random seed for GPT-4(V)-Turbo and GPT-4o to obtain more deterministic responses. To prevent models from abstaining (which happens rarely), we rerun the query until an answer is provided.
3.2 Datasets
We benchmark the model performance on 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We choose to focus on image classification tasks as other tasks such as region captioning would require substantially more tokens thereby limiting the total number of demonstrating examples, and most LMMs are not yet capable of accurately producing localizations required for other tasks like bounding boxes and segmentation masks [17, 18]. Table 1 provides a summary of the datasets used in this study.
For all datasets, we construct a set of demonstration (demo) examples from the original training and validation splits used for in-context learning and a test set from the original test split (if one exists) to evaluate the performance of the models. We randomly sample the demo and test sets from the original dataset without replacement. For the multi-class and fine-grained classification datasets, we perform a class-stratified sampling, ensuring an equal number of examples per class in both the demo and test sets. For the multi-label classification dataset (CheXpert), we sample an equal number of positive and negative samples per class in both the demo and test sets. We note that, since the task is multi-label, this sampling procedure does not result in an exactly equal number of examples per class. The per-dataset sizes of the full demo and test sets are shown in Table 1, and we increase the number of demonstration examples up to the numbers shown in the table while ensuring class balance for the scaling experiments.
3.3 Evaluation Metrics
We use standard metrics to evaluate model performance on each dataset. Specifically, we measure performance using accuracy for all multi-class classification datasets as they are sampled to have a balanced class distribution. For multi-label classification on CheXpert, we use the macro-averaged F1 metric. In the rare case of parsing errors, we consider the response as incorrect. To estimate the variability around the evaluation metrics, we compute standard deviation using bootstrapping with 1,000 bootstrap replicates.
Authors:
(1) Yixing Jiang, Stanford University ([email protected]);
(2) Jeremy Irvin, Stanford University ([email protected]);
(3) Ji Hun Wang, Stanford University ([email protected]);
(4) Muhammad Ahmed Chaudhry, Stanford University ([email protected]);
(5) Jonathan H. Chen, Stanford University ([email protected]);
(6) Andrew Y. Ng, Stanford University ([email protected]).
This paper is