Abstract and 1. Introduction

  1. Related Works

  2. MaGGIe

    3.1. Efficient Masked Guided Instance Matting

    3.2. Feature-Matte Temporal Consistency

  3. Instance Matting Datasets

    4.1. Image Instance Matting and 4.2. Video Instance Matting

  4. Experiments

    5.1. Pre-training on image data

    5.2. Training on video data

  5. Discussion and References

Supplementary Material

  1. Architecture details

  2. Image matting

    8.1. Dataset generation and preparation

    8.2. Training details

    8.3. Quantitative details

    8.4. More qualitative results on natural images

  3. Video matting

    9.1. Dataset generation

    9.2. Training details

    9.3. Quantitative details

    9.4. More qualitative results

6. Discussion

Limitation and Future work. Our MaGGIe demonstrates effective performance in human video instance matting with binary mask guidance, yet it also presents opportunities for further research and development. One notable limitation is the reliance on one-hot vector representation for each location in the guidance mask, necessitating that each pixel is distinctly associated with a single instance. This requirement can pose challenges, particularly when integrating instance masks from varied sources, potentially leading to misalignments in certain regions. Additionally, the use of composite training datasets may constrain the model’s ability to generalize effectively to natural, real-world scenarios. While the creation of a comprehensive natural dataset remains a valuable goal, we propose an interim solution: the utilization of segmentation datasets combined with self-supervised or weakly-supervised learning techniques. This approach could enhance the model’s adaptability and performance in more diverse and realistic settings, paving the way for future advancements in the field.

Conclusion. Our study contributes to the evolving field of instance matting, with a focus that extends beyond human subjects. By integrating advanced techniques like transformer attention and sparse convolution, MaGGIe shows promising improvements over previous methods in detailed accuracy, temporal consistency, and computational efficiency for both image and video inputs. Additionally, our approach in synthesizing training data and developing a comprehensive benchmarking schema offers a new way to evaluate the robustness and effectiveness of models in instance matting tasks. This work represents a step forward in video instance matting and provides a foundation for future research in this area.

Acknowledgement. We sincerely appreciate Markus Woodson for the invaluable initial discussions. Additionally, I am deeply thankful to my wife, Quynh Phung, for her meticulous proofreading and feedback.

References

[1] Adobe. Adobe premiere. https://www.adobe.com/ products/premiere.html, 2023. 1

[2] Apple. Cutouts object ios 16. https://support. apple.com/en-hk/102460, 2023. 1

[3] Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015. 4

[4] Arie Berman, Arpag Dadourian, and Paul Vlahos. Method for removing from an image the background surrounding a selected object, 2000. US Patent 6,134,346. 2

[5] Guowei Chen, Yi Liu, Jian Wang, Juncai Peng, Yuying Hao, Lutao Chu, Shiyu Tang, Zewu Wu, Zeyu Chen, Zhiliang Yu, et al. Pp-matting: high-accuracy natural image matting. arXiv preprint arXiv:2204.09433, 2022. 2

[6] Xiangguang Chen, Ye Zhu, Yu Li, Bingtao Fu, Lei Sun, Ying Shan, and Shan Liu. Robust human matting via semantic guidance. In ACCV, 2022. 2

[7] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. 2

[8] Ho Kei Cheng and Alexander G Schwing. Xmem: Longterm video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022. 1, 5

[9] Donghyeon Cho, Yu-Wing Tai, and Inso Kweon. Natural image matting using deep convolutional neural networks. In ECCV, 2016. 2

[10] Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/traveller59/ spconv, 2022. 5

[11] Marco Forte and Franc¸ois Pitie.´ f, b, alpha matting. arXiv preprint arXiv:2003.07711, 2020. 1, 2

[12] Google. Magic editor in google pixel 8. https : //pixel.withgoogle.com/Pixel_8_Pro/usemagic-editor, 2023. 1

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 11

[14] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 13

[15] Anna Katharina Hebborn, Nils Hohner, and Stefan Muller. Occlusion matting: realistic occlusion handling for augmented reality applications. In 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2017. 1

[16] Qiqi Hou and Feng Liu. Context-aware image matting for simultaneous foreground and alpha estimation. In ICCV, 2019. 1

[17] Wei-Lun Huang and Ming-Sui Lee. End-to-end video matting with trimap propagation. In CVPR, 2023. 1, 2, 3, 7, 23

[18] Chuong Huynh, Anh Tuan Tran, Khoa Luu, and Minh Hoai. Progressive semantic segmentation. In CVPR, 2021. 2

[19] Chuong Huynh, Yuqian Zhou, Zhe Lin, Connelly Barnes, Eli Shechtman, Sohrab Amirghodsi, and Abhinav Shrivastava. Simpson: Simplifying photo cleanup with single-click distracting object segmentation network. In CVPR, 2023. 2

[20] Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan. Pytorch. Programming with TensorFlow: Solution for Edge Computing Applications, 2021. 5

[21] Lei Ke, Henghui Ding, Martin Danelljan, Yu-Wing Tai, ChiKeung Tang, and Fisher Yu. Video mask transfiner for highquality video instance segmentation. In ECCV, 2022. 2

[22] Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Rynson WH Lau. Modnet: Real-time trimap-free portrait matting via objective decomposition. In AAAI, 2022. 2

[23] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In ICCV, 2023. 2, 3

[24] Philip Lee and Ying Wu. Nonlocal matting. In CVPR, 2011. 2

[25] Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting. IEEE TPAMI, 30(2), 2007. 2

[26] Jizhizi Li, Sihan Ma, Jing Zhang, and Dacheng Tao. Privacypreserving portrait matting. In ACM MM, 2021. 2

[27] Jizhizi Li, Jing Zhang, and Dacheng Tao. Deep automatic natural image matting. In IJCAI, 2021. 2

[28] Jiachen Li, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Yunchao Wei, and Humphrey Shi. Vmformer: End-to-end video matting with transformer. arXiv preprint arXiv:2208.12801, 2022. 3

[29] Jizhizi Li, Jing Zhang, Stephen J Maybank, and Dacheng Tao. Bridging composite and real: towards end-to-end deep image matting. IJCV, 2022. 2, 13

[30] Jiachen Li, Roberto Henschel, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, and Humphrey Shi. Video instance matting. In WACV, 2024. 2

[31] Yaoyi Li and Hongtao Lu. Natural image matting via guided contextual attention. In AAAI, 2020. 1, 2

[32] Chung-Ching Lin, Jiang Wang, Kun Luo, Kevin Lin, Linjie Li, Lijuan Wang, and Zicheng Liu. Adaptive human matting for dynamic videos. In CVPR, 2023. 2, 3

[33] Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian L Curless, Steven M Seitz, and Ira KemelmacherShlizerman. Real-time high-resolution background matting. In CVPR, 2021. 2, 3, 5

[34] Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with temporal guidance. In WACV, 2022. 2, 3

[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence ´ Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2

[36] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In CVPR, 2015. 2

[37] Hao Lu, Yutong Dai, Chunhua Shen, and Songcen Xu. Indices matter: Learning to index for deep image matting. In CVPR, 2019. 1, 2

[38] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, 2019. 1

[39] Kwanyong Park, Sanghyun Woo, Seoung Wug Oh, In So Kweon, and Joon-Young Lee. Mask-guided matting in the wild. In CVPR, 2023. 1, 2, 3, 6, 19

[40] Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott Cohen, Quan Tran, and Abhinav Shrivastava. Improving closed and open-vocabulary attribute prediction using transformers. In ECCV, 2022. 2

[41] Khoi Pham, Chuong Huynh, and Abhinav Shrivastava. Composing object relations and attributes for image-text matching. In CVPR, 2024.

[42] Quynh Phung, Songwei Ge, and Jia-Bin Huang. Grounded text-to-image synthesis with attention refocusing. In CVPR, 2024. 2

[43] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. 13

[44] Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Background matting: The world is your green screen. In CVPR, 2020. 1

[45] Hongje Seong, Seoung Wug Oh, Brian Price, Euntai Kim, and Joon-Young Lee. One-trimap video matting. In ECCV, 2022. 1, 2, 3, 5, 6, 7, 23

[46] Xiaoyong Shen, Xin Tao, Hongyun Gao, Chao Zhou, and Jiaya Jia. Deep automatic portrait matting. In ECCV, 2016. 2

[47] Yanan Sun, Chi-Keung Tang, and Yu-Wing Tai. Semantic image matting. In CVPR, 2021. 2 [48] Yanan Sun, Guanzhi Wang, Qiao Gu, Chi-Keung Tang, and Yu-Wing Tai. Deep video matting via spatio-temporal alignment and aggregation. In CVPR, 2021. 3, 6

[49] Yanan Sun, Chi-Keung Tang, and Yu-Wing Tai. Human instance matting via mutual guidance and multi-instance refinement. In CVPR, 2022. 1, 2, 3, 5, 6, 7, 11, 13, 14, 16, 17, 18, 20

[50] Yanan Sun, Chi-Keung Tang, and Yu-Wing Tai. Ultrahigh resolution image/video matting with spatio-temporal sparsity. In CVPR, 2023. 2, 3, 4, 5, 6, 7, 12, 13, 16, 17, 18, 20

[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017. 3

[52] Tiantian Wang, Sifei Liu, Yapeng Tian, Kai Li, and MingHsuan Yang. Video matting via consistency-regularized graph neural networks. In ICCV, 2021. 3, 5

[53] Yumeng Wang, Bo Xu, Ziwen Li, Han Huang, Cheng Lu, and Yandong Guo. Video object matting via hierarchical space-time semantic guidance. In WACV, 2023. 2, 3

[54] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. In CVPR, 2017. 2

[55] Zongxin Yang, Yunchao Wei, and Yi Yang. Associating objects with transformers for video object segmentation. NeurIPS, 2021. 2, 3, 11

[56] Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, and Alan Yuille. Mask guided matting via progressive refinement network. In CVPR, 2021. 1, 2, 3, 5, 6, 7, 11, 13, 16, 17, 18, 19

[57] Yunke Zhang, Chi Wang, Miaomiao Cui, Peiran Ren, Xuansong Xie, Xian-Sheng Hua, Hujun Bao, Qixing Huang, and Weiwei Xu. Attention-guided temporally coherent video object matting. In ACM MM, 2021. 3, 5, 6, 7

Authors:

(1) Chuong Huynh, University of Maryland, College Park ([email protected]);

(2) Seoung Wug Oh, Adobe Research (seoh,[email protected]);

(3) Abhinav Shrivastava, University of Maryland, College Park ([email protected]);

(4) Joon-Young Lee, Adobe Research ([email protected]).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.