Table of Links
-
Related Work
-
Methodology
3.1. Preliminaries and Notations
3.2. Relations between Attention-based VPG and MIL
3.3. MIVPG for Multiple Visual Inputs
3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios
-
Experiments and 4.1. General Setup
4.2. Scenario 1: Samples with Single Image
4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding
Supplementary Material
A. Detailed Architecture of QFormer
4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding
Next, we evaluate our method in scenarios involving multiple images, where each image contributes only one embedding as its representation. Specifically, we utilize the PatchGastricADC22[36] dataset, which is a Whole Slide Image (WSI) dataset. This dataset includes 991 WSIs of H&Estained gastric adenocarcinoma specimens, accompanied by diagnostic captions extracted directly from existing medical reports. The dataset encompasses a total of 262,777 medical patches, with each WSI containing up to 1860 patches. Each medical patch has a size of 300 × 300, which will be encoded by the visual encoder after resizing. The dataset is partitioned into training, validation, and test subsets using the methodology outlined in [36], with a split ratio of 0.7, 0.1, and 0.2, respectively. We compare the proposed method against baselines in [36], which are a combination of a visual model (DenseNet121[15] or EfficientNetB3[35]) and an LSTM[12] as the language model. To ensure a fair comparison, we conduct three experiments with different random seeds and follow the same data augmentation in [36]. In a medical patch, the focus is typically on global information rather than local details. Additionally, given that a WSI can comprise a large number of patches, we aim to reduce computational overhead. Therefore, we choose to use only the [CLS] token output by ViT as the representation for the entire medical patch. In this case, P = 1.
As demonstrated in Table 1, our method outperforms the baselines significantly. This result highlights the effectiveness of employing large-scale models in downstream tasks. Moreover, the experiments indicate that the model performs even better when considering correlations among instances, underscoring the effectiveness of our CSA module. Furthermore, we are interested in observing how captions generated by the LLM evolve as the number of training epochs increases. Given the substantial domain gap between medical images and natural images, we believe that existing MLLMs have rarely been trained on medical images, rendering them less domain-specific in medical analysis. As depicted in Figure 5, under the zero-shot setting, BLIP2 struggles to generate detailed captions for the provided WSIs. However, with an increasing number of training epochs, the model acquires domain-specific knowledge and produces more relevant captions. Similar to the process of human learning, a discernible trend is observed, where the model initially generates very general captions and gradually incorporates more and more details as the number of epochs increases.
Authors:
(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);
(2) Wenyi Wu, Amazon ([email protected]);
(3) Qi Li, Amazon ([email protected]);
(4) Rob Barton, Amazon ([email protected]);
(5) Boxin Du, Amazon ([email protected]);
(6) Shioulin Sam, Amazon ([email protected]);
(7) Karim Bouyarmane, Amazon ([email protected]);
(8) Ismail Tutar, Amazon ([email protected]);
(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).
This paper is
[1] For consistency, we opted for metrics implemented in https://github.com/salaniz/pycocoevalcap.