TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DVIS: Decoupled Video Instance Segmentation Framework

DVIS: Decoupled Video Instance Segmentation Framework

Tao Zhang, Xingye Tian, Yu Wu, Shunping Ji, Xuebo Wang, Yuan Zhang, Pengfei Wan

2023-06-06ICCV 2023 1Video EditingVideo Panoptic SegmentationSegmentationAutonomous DrivingSemantic SegmentationInstance SegmentationVideo Instance Segmentation
PaperPDFCode(official)

Abstract

Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

Results

TaskDatasetMetricValueModel
Semantic SegmentationVIPSegSTQ55.3DVIS(Swin-L)
Semantic SegmentationVIPSegVPQ57.6DVIS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP5083DVIS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP7568.4DVIS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR147.7DVIS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR1065.7DVIS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021mask AP60.1DVIS(Swin-L)
Video Instance SegmentationYouTube-VIS validationAP5088DVIS
Video Instance SegmentationYouTube-VIS validationAP7572.7DVIS
Video Instance SegmentationYouTube-VIS validationAR156.5DVIS
Video Instance SegmentationYouTube-VIS validationAR1070.3DVIS
Video Instance SegmentationYouTube-VIS validationmask AP64.9DVIS
Video Instance SegmentationOVIS validationAP5075.9DVIS(Swin-L, Offline)
Video Instance SegmentationOVIS validationAP7553DVIS(Swin-L, Offline)
Video Instance SegmentationOVIS validationAR119.4DVIS(Swin-L, Offline)
Video Instance SegmentationOVIS validationAR1055.3DVIS(Swin-L, Offline)
Video Instance SegmentationOVIS validationmask AP49.9DVIS(Swin-L, Offline)
Video Instance SegmentationOVIS validationAP5071.9DVIS(Swin-L, Online)
Video Instance SegmentationOVIS validationAP7549.2DVIS(Swin-L, Online)
Video Instance SegmentationOVIS validationAR119.4DVIS(Swin-L, Online)
Video Instance SegmentationOVIS validationAR1052.5DVIS(Swin-L, Online)
Video Instance SegmentationOVIS validationmask AP47.1DVIS(Swin-L, Online)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP50_L69DVIS(Swin-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP75_L48.8DVIS(Swin-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR10_L51.8DVIS(Swin-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR1_L37.2DVIS(Swin-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationmAP_L45.9DVIS(Swin-L)
10-shot image generationVIPSegSTQ55.3DVIS(Swin-L)
10-shot image generationVIPSegVPQ57.6DVIS(Swin-L)
Panoptic SegmentationVIPSegSTQ55.3DVIS(Swin-L)
Panoptic SegmentationVIPSegVPQ57.6DVIS(Swin-L)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving2025-07-19AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework2025-07-18Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17