This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: oQF6WnrqlJzfEC4AjVDMIWSTBuNl_WbSd1KQuQGaaJE
Cover

Video Instance Matting: Comparing Temporal Consistency and Detail Preservation

Written by @instancing | Published on 2025/12/23

TL;DR
MaGGIe balances temporal consistency and detail preservation, outperforming SparseMat in accuracy and matching InstMatt's high-fidelity output

Abstract and 1. Introduction

  1. Related Works

  2. MaGGIe

    3.1. Efficient Masked Guided Instance Matting

    3.2. Feature-Matte Temporal Consistency

  3. Instance Matting Datasets

    4.1. Image Instance Matting and 4.2. Video Instance Matting

  4. Experiments

    5.1. Pre-training on image data

    5.2. Training on video data

  5. Discussion and References

Supplementary Material

  1. Architecture details

  2. Image matting

    8.1. Dataset generation and preparation

    8.2. Training details

    8.3. Quantitative details

    8.4. More qualitative results on natural images

  3. Video matting

    9.1. Dataset generation

    9.2. Training details

    9.3. Quantitative details

    9.4. More qualitative results

9.4. More qualitative results

For a more immersive and detailed understanding of our model’s performance, we recommend viewing the examples on our website which includes comprehensive results and comparisons with previous methods. Additionally, we have highlighted outputs from specific frames in Fig. 19.

Regarding temporal consistency, SparseMat and our framework exhibit comparable results, but our model demonstrates more accurate outcomes. Notably, our output maintains a level of detail on par with InstMatt, while ensuring consistent alpha values across the video, particularly in background and foreground regions. This balance between detail preservation and temporal consistency highlights the advanced capabilities of our model in handling the complexities of video instance matting.

For each example, the first-frame human masks are generated by r101 fpn 400e and propagated by XMem for the rest of the video.

Table 15. Our framework also reduces the errors of trimap propagation baselines. When replacing those models’ matte decoders with ours, the number in all error metrics was reduced by a large margin. Gray rows denote the module from public weights without retraining on our V-HIM2K5 dataset.

Authors:

(1) Chuong Huynh, University of Maryland, College Park (chuonghm@cs.umd.edu);

(2) Seoung Wug Oh, Adobe Research (seoh,jolee@adobe.com);

(3) Abhinav Shrivastava, University of Maryland, College Park (abhinav@cs.umd.edu);

(4) Joon-Young Lee, Adobe Research (jolee@adobe.com).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

[story continues]


Written by
@instancing
Pioneering instance management, driving innovative solutions for efficient resource utilization, and enabling a more sus

Topics and
tags
deep-learning|video-matting-qualitative|temporal-consistency|video-instance-matting|alpha-value-stability|sparsemat-comparison|instmatt|xmem-propagation
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: oQF6WnrqlJzfEC4AjVDMIWSTBuNl_WbSd1KQuQGaaJE