Abstract and 1. Introduction

  1. Related Work

    2.1 Open-world Video Instance Segmentation

    2.2 Dense Video Object Captioning and 2.3 Contrastive Loss for Object Queries

    2.4 Generalized Video Understanding and 2.5 Closed-World Video Instance Segmentation

  2. Approach

    3.1 Overview

    3.2 Open-World Object Queries

    3.3 Captioning Head

    3.4 Inter-Query Contrastive Loss and 3.5 Training

  3. Experiments and 4.1 Datasets and Evaluation Metrics

    4.2 Main Results

    4.3 Ablation Studies and 4.4 Qualitative Results

  4. Conclusion, Acknowledgements, and References

Supplementary Material

A. Additional Analysis

B. Implementation Details

C. Limitations

2.1 Open-world Video Instance Segmentation

Prompt-less methods. Several works have explored the open-world video instance segmentation problem without requiring initial prompts [2,10,28,39,44]. New objects are discovered based on an objectness score. Some works [2, 28] rely on classic region-based object proposals. However, recent query-based methods [10, 39] have shown to outperform those methods for object detection and segmentation. OW-VISFormer [39] proposes a feature enrichment mechanism and a spatio-temporal objectness module to distinguish the foreground objects from the background. Different from our work, they operate on one type of object query. Also orthogonal to our work is DEVA [10], which develops a class-agnostic temporal propagation approach to track objects detected by a segmentation network. UVO [44] and BURST [2] provide novel datasets and region-proposal-based baselines for the task of open-world video instance segmentation. In this work, we evaluate our approach on the BURST [2] dataset as it is more diverse and allows to separately evaluate common and uncommon classes.

Prompt-based methods. Prompt-based methods rely on prompts, i.e., prior knowledge, to segment objects in videos. Video object segmentation (VOS) [11, 17, 19, 20, 35, 40, 43, 51, 56] is a task where the ground truth segmentations of objects (belonging to any category) in the first frame are available, and act as prompts to track the objects throughout the video. CLIPSeg [31] uses words as prompts, by projecting the words onto the image space following CLIP [38], and calculating a similarity matrix to obtain a segmentation mask for the given word prompt. Prompt-based methods work when one has some prior knowledge on what or where to look for. However, in a true open-world setting, such prior knowledge may not be available. We operate in such a setting.

In this work, we combine the advantages of both prompt-based and promptless methods. Notably, we do not require additional prompts from the ground truth or separate networks. Instead, we use equally spaced points distributed over the video frames and encode them as prompts to discover new objects. Also note, that all the aforementioned methods, prompt-less or prompt-based, assign only a one-word label to the detected object. In contrast, we are interested in generating a rich object-centric caption for the given object in this work.

2.2 Dense Video Object Captioning

Dense video object captioning involves detecting, tracking, and captioning trajectories of objects in a video. For this task, DVOC-DS [58] extends the imagebased GRiT [47] and trains it with a mixture of disjoint tasks. In addition, existing video grounding datasets VidSTG [57] and VLN [41] are re-purposed for this task. However, DVOC-DS [58] cannot caption multiple action segments within a single object trajectory because the method produces a single caption for the entire object trajectory. In addition, similar to many other video models [1, 45, 48], DVOC-DS [58] struggles with very long videos and only processes up to 200 frames. The method is further constrained to handle only known object categories and it is unclear how the method extends to an open-world setting.

Unlike DVOC-DS [58], we use open-world object queries to tackle open-world video instance segmentation. Differently, in this work, we also process videoframes sequentially using a short temporal context. Hence, our method is able to process long videos, as well as handle multiple action segments within a single object trajectory. Finally, we leverage masked attention for dense video object captioning, which has not been explored before.

2.3 Contrastive Loss for Object Queries

Many recent approaches have used a contrastive loss to help in video instance segmentation. OWVISFormer [39] uses a contrastive loss in the open-world setting: it ensures that assigned foreground objects are similar to each other while being different from the background objects. IDOL [50] works in the closed-world setting, and uses an interframe contrastive loss to ensure object queries belonging to the same object across frames are similar, and object queries of different instances across frames differ. In contrast, in this work, for both the closed- and open-world setting, we use a contrastive loss to ensure that no two object queries in the foreground are similar to each other, even in the same frame.

2.4 Generalized Video Understanding

Recently, there has been progress in unifying different video related tasks. TubeFormer [23], Unicorn [53], and CAROQ [13] unify different video segmentation tasks in the closed world. DVOC-DS [58] unifies the tasks of detecting closedworld objects in videos and captioning those closed-world objects. In this work, we explore the task of detecting or segmenting both closed- and open-world objects in videos, and captioning these objects.

Video understanding starts from a strong generalized image understanding. Several works generalize multiple image related tasks. Some methods [8, 9, 21] combine different image segmentation methods, and provide a baseline for many different video understanding tasks [7, 13, 18, 50]. X-Decoder [59] unifies different image segmentation tasks along with the task of referring image segmentation. SAM [24] introduces a vision foundation model, that primarily performs prompt-based open-world image segmentation, and can be used for many downstream tasks. Different from these works, we develop a generalized method for videos that tackles segmentation and object-centric captioning for both open and closed-world objects.

2.5 Closed-World Video Instance Segmentation

Closed-world video instance segmentation involves simultaneously segmenting and tracking objects from a fixed category-set in a video. Some works [3–5, 12, 16, 32, 36, 42, 52, 54, 55] rely on classical region-based proposals. Recent works [7, 13, 18, 33, 46, 49, 50] rely on query-based proposals and perform significantly better at discovering closed-world objects. Differently, in this work, we explore query-based proposals for both the closed- and open-world setting.

Authors:

(1) Anwesa Choudhuri, University of Illinois at Urbana-Champaign ([email protected]);

(2) Girish Chowdhary, University of Illinois at Urbana-Champaign ([email protected]);

(3) Alexander G. Schwing, University of Illinois at Urbana-Champaign ([email protected]).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.