TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DeVIS: Making Deformable Transformers Work for Video Insta...

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

Adrià Caelles, Tim Meinhardt, Guillem Brasó, Laura Leal-Taixé

2022-07-22SegmentationSemantic SegmentationInstance SegmentationVideo Instance Segmentationobject-detectionObject Detection
PaperPDFCode(official)

Abstract

Video Instance Segmentation (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences. In the past, VIS methods mirrored the fragmentation of these subtasks in their architectural design, hence missing out on a joint solution. Transformers recently allowed to cast the entire VIS task as a single set-prediction problem. Nevertheless, the quadratic complexity of existing Transformer-based methods requires long training times, high memory requirements, and processing of low-single-scale feature maps. Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored. In this work, we present Deformable VIS (DeVIS), a VIS method which capitalizes on the efficiency and performance of deformable Transformers. To reason about all VIS subtasks jointly over multiple frames, we present temporal multi-scale deformable attention with instance-aware object queries. We further introduce a new image and video instance mask head with multi-scale features, and perform near-online video processing with multi-cue clip tracking. DeVIS reduces memory as well as training time requirements, and achieves state-of-the-art results on the YouTube-VIS 2021, as well as the challenging OVIS dataset. Code is available at https://github.com/acaelles97/DeVIS.

Results

TaskDatasetMetricValueModel
Video Instance SegmentationYouTube-VIS 2021AP5077.7DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP7559.8DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR143.8DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR1057.8DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021mask AP54.4DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP5066.8DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AP7546.6DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR138DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR1050.1DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021mask AP43.1DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAP5080.8DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAP7566.3DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAR150.8DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAR1061DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationmask AP57.1DeVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAP5066.7DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAP7548.6DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR142.4DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR1051.6DeVIS (ResNet-50)
Video Instance SegmentationYouTube-VIS validationmask AP44.4DeVIS (ResNet-50)
Video Instance SegmentationOVIS validationAP5059.3DeVIS (Swin-L)
Video Instance SegmentationOVIS validationAP7538.3DeVIS (Swin-L)
Video Instance SegmentationOVIS validationAR116.6DeVIS (Swin-L)
Video Instance SegmentationOVIS validationAR1039.8DeVIS (Swin-L)
Video Instance SegmentationOVIS validationmask AP35.5DeVIS (Swin-L)
Video Instance SegmentationOVIS validationAP5047.6DeVIS (ResNet-50)
Video Instance SegmentationOVIS validationAP7520.8DeVIS (ResNet-50)
Video Instance SegmentationOVIS validationAR112DeVIS (ResNet-50)
Video Instance SegmentationOVIS validationAR1028.9DeVIS (ResNet-50)
Video Instance SegmentationOVIS validationmask AP23.7DeVIS (ResNet-50)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17