TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DVIS++: Improved Decoupled Framework for Universal Video S...

DVIS++: Improved Decoupled Framework for Universal Video Segmentation

Tao Zhang, Xingye Tian, Yikang Zhou, Shunping Ji, Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu Wu

2023-12-20DenoisingPanoptic SegmentationVideo Panoptic SegmentationSegmentationSemantic SegmentationVideo SegmentationContrastive LearningInstance SegmentationVideo Semantic SegmentationVideo Instance Segmentation
PaperPDFCode(official)

Abstract

We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos. Accordingly, we introduce two novel components: the referring tracker and the temporal refiner. These components track objects frame by frame and model spatio-temporal representations based on pre-aligned features. To improve the tracking capability of DVIS, we propose a denoising training strategy and introduce contrastive learning, resulting in a more robust framework named DVIS++. Furthermore, we evaluate DVIS++ in various settings, including open vocabulary and using a frozen pre-trained backbone. By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework. We conduct extensive experiments on six mainstream benchmarks, including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++ significantly outperforms state-of-the-art specialized methods on these benchmarks in both close- and open-vocabulary settings. Code:~\url{https://github.com/zhang-tao-whu/DVIS_Plus}.

Results

TaskDatasetMetricValueModel
Scene ParsingVSPWmIoU63.8DVIS++(VIT-L)
Semantic SegmentationVIPSegSTQ56DVIS++(VIT-L)
Semantic SegmentationVIPSegVPQ58DVIS++(VIT-L)
Video Semantic SegmentationVSPWmIoU63.8DVIS++(VIT-L)
Scene UnderstandingVSPWmIoU63.8DVIS++(VIT-L)
Video Instance SegmentationYouTube-VIS 2021AP5086.7DVIS++(VIT-L, Offline)
Video Instance SegmentationYouTube-VIS 2021AP7571.5DVIS++(VIT-L, Offline)
Video Instance SegmentationYouTube-VIS 2021AR148.8DVIS++(VIT-L, Offline)
Video Instance SegmentationYouTube-VIS 2021AR1069.5DVIS++(VIT-L, Offline)
Video Instance SegmentationYouTube-VIS 2021mask AP63.9DVIS++(VIT-L, Offline)
Video Instance SegmentationYouTube-VIS 2021AP5082.7DVIS++(VIT-L, Online)
Video Instance SegmentationYouTube-VIS 2021AP7570.2DVIS++(VIT-L, Online)
Video Instance SegmentationYouTube-VIS 2021AR149.5DVIS++(VIT-L, Online)
Video Instance SegmentationYouTube-VIS 2021AR1068DVIS++(VIT-L, Online)
Video Instance SegmentationYouTube-VIS 2021mask AP62.3DVIS++(VIT-L, Online)
Video Instance SegmentationYouTube-VIS validationAP5088.8DVIS++(ViT-L, Online)
Video Instance SegmentationYouTube-VIS validationAP7575.3DVIS++(ViT-L, Online)
Video Instance SegmentationYouTube-VIS validationAR157.9DVIS++(ViT-L, Online)
Video Instance SegmentationYouTube-VIS validationAR1073.7DVIS++(ViT-L, Online)
Video Instance SegmentationYouTube-VIS validationmask AP67.7DVIS++(ViT-L, Online)
Video Instance SegmentationOVIS validationAP5078.9DVIS++(VIT-L,Offline)
Video Instance SegmentationOVIS validationAP7558.5DVIS++(VIT-L,Offline)
Video Instance SegmentationOVIS validationmask AP53.4DVIS++(VIT-L,Offline)
Video Instance SegmentationOVIS validationAP5072.5DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationAP7555DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationAPho27.1DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationAPmo56.6DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationAPso69.9DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationAR120.8DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationAR1054.6DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationmask AP49.6DVIS++(VIT-L, Online)
Video Instance SegmentationOVIS validationAP5068.9DVIS++(R50, Offline)
Video Instance SegmentationOVIS validationAP7540.9DVIS++(R50, Offline)
Video Instance SegmentationOVIS validationAR116.8DVIS++(R50, Offline)
Video Instance SegmentationOVIS validationAR1047.3DVIS++(R50, Offline)
Video Instance SegmentationOVIS validationmask AP41.2DVIS++(R50, Offline)
Video Instance SegmentationOVIS validationAP5062.8DVIS++(R50, Online)
Video Instance SegmentationOVIS validationAP7537.3DVIS++(R50, Online)
Video Instance SegmentationOVIS validationAR115.8DVIS++(R50, Online)
Video Instance SegmentationOVIS validationAR1042.9DVIS++(R50, Online)
Video Instance SegmentationOVIS validationmask AP37.2DVIS++(R50, Online)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP50_L75.7DVIS++(VIT-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP75_L52.8DVIS++(VIT-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR10_L55.8DVIS++(VIT-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR1_L40.6DVIS++(VIT-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationmAP_L50.9DVIS++(VIT-L)
2D Semantic SegmentationVSPWmIoU63.8DVIS++(VIT-L)
10-shot image generationVIPSegSTQ56DVIS++(VIT-L)
10-shot image generationVIPSegVPQ58DVIS++(VIT-L)
Panoptic SegmentationVIPSegSTQ56DVIS++(VIT-L)
Panoptic SegmentationVIPSegVPQ58DVIS++(VIT-L)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17