Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

4. Experiments

To assess the effectiveness of our proposed approach, we conduct evaluations across various scenarios:

  1. where each sample comprises a single image, and patches are naturally considered as instances;

  2. where each sample includes multiple instances, but we use a general embedding for each image;

  3. where each sample contains multiple images, with each image containing multiple patches.

4.1. General Setup

We initialize our model using BLIP2 [22] with FLAN-T5- XL. MIVPG is initialized with weights from QFormer. The model consists of a frozen language model and a frozen visual model. During training, we only update the MIVPG. The visual encoder, ViT-G, is employed to encode images into patches of embeddings, and the images are resized to dimensions of 224 × 224. In our experiments, we observed that unfreezing the visual encoder does not lead to additional improvements in datasets with small sizes. Further details can be found in the supplementary C.1.

Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.