TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/InstanceFormer: An Online Video Instance Segmentation Fram...

InstanceFormer: An Online Video Instance Segmentation Framework

Rajat Koner, Tanveer Hannan, Suprosanna Shit, Sahand Sharifzadeh, Matthias Schubert, Thomas Seidl, Volker Tresp

2022-08-22Semantic SegmentationInstance SegmentationVideo Instance Segmentation
PaperPDFCode(official)

Abstract

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at https://github.com/rajatkoner08/InstanceFormer.

Results

TaskDatasetMetricValueModel
Video Instance SegmentationYouTube-VIS 2021AP5073.7InstanceFormer (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP7556.9InstanceFormer (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR142.8InstanceFormer (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR1056InstanceFormer (Swin-L)
Video Instance SegmentationYouTube-VIS 2021mask AP51InstanceFormer (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP5062.4InstanceFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AP7543.7InstanceFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR136.1InstanceFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR1048.1InstanceFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021mask AP40.8InstanceFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAP5078InstanceFormer(Swin-L)
Video Instance SegmentationYouTube-VIS validationAP7564.2InstanceFormer(Swin-L)
Video Instance SegmentationYouTube-VIS validationAR150.9InstanceFormer(Swin-L)
Video Instance SegmentationYouTube-VIS validationAR1061.6InstanceFormer(Swin-L)
Video Instance SegmentationYouTube-VIS validationmask AP56.3InstanceFormer(Swin-L)
Video Instance SegmentationYouTube-VIS validationAP5068.6InstanceFormer(ResNet-50)
Video Instance SegmentationYouTube-VIS validationAP7549.6InstanceFormer(ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR142.1InstanceFormer(ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR1053.5InstanceFormer(ResNet-50)
Video Instance SegmentationYouTube-VIS validationmask AP45.6InstanceFormer(ResNet-50)
Video Instance SegmentationOVIS validationAP5042.5InstanceFormer (Swin-L)
Video Instance SegmentationOVIS validationAP7521.61InstanceFormer (Swin-L)
Video Instance SegmentationOVIS validationAR112.9InstanceFormer (Swin-L)
Video Instance SegmentationOVIS validationAR1029.3InstanceFormer (Swin-L)
Video Instance SegmentationOVIS validationmask AP22.8InstanceFormer (Swin-L)
Video Instance SegmentationOVIS validationAP5040.7InstanceFormer(ResNet-50)
Video Instance SegmentationOVIS validationAP7518.1InstanceFormer(ResNet-50)
Video Instance SegmentationOVIS validationAR112InstanceFormer(ResNet-50)
Video Instance SegmentationOVIS validationAR1027.1InstanceFormer(ResNet-50)
Video Instance SegmentationOVIS validationmask AP20InstanceFormer(ResNet-50)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP50_L44.6InstanceFormer (Swin)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP75_L27.3InstanceFormer (Swin)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR10_L29.2InstanceFormer (Swin)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR1_L25InstanceFormer (Swin)
Video Instance SegmentationYoutube-VIS 2022 ValidationmAP_L26.3InstanceFormer (Swin)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP50_L49.5InstanceFormer (Resnet-50)
Video Instance SegmentationYoutube-VIS 2022 ValidationAP75_L26.7InstanceFormer (Resnet-50)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR10_L30.1InstanceFormer (Resnet-50)
Video Instance SegmentationYoutube-VIS 2022 ValidationAR1_L23.9InstanceFormer (Resnet-50)
Video Instance SegmentationYoutube-VIS 2022 ValidationmAP_L24.8InstanceFormer (Resnet-50)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15