TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SeqFormer: Sequential Transformer for Video Instance Segme...

SeqFormer: Sequential Transformer for Video Instance Segmentation

Junfeng Wu, Yi Jiang, Song Bai, Wenqing Zhang, Xiang Bai

2021-12-15Semantic SegmentationInstance SegmentationVideo Instance Segmentation
PaperPDFCodeCode(official)

Abstract

In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code is available at https://github.com/wjf5203/SeqFormer.

Results

TaskDatasetMetricValueModel
Video Instance SegmentationYouTube-VIS validationAP5082.1SeqFormer (Swin-L)
Video Instance SegmentationYouTube-VIS validationAP7566.4SeqFormer (Swin-L)
Video Instance SegmentationYouTube-VIS validationAR151.7SeqFormer (Swin-L)
Video Instance SegmentationYouTube-VIS validationAR1064.4SeqFormer (Swin-L)
Video Instance SegmentationYouTube-VIS validationmask AP59.3SeqFormer (Swin-L)
Video Instance SegmentationYouTube-VIS validationAP5071.1SeqFormer (ResNet-101)
Video Instance SegmentationYouTube-VIS validationAP7555.7SeqFormer (ResNet-101)
Video Instance SegmentationYouTube-VIS validationAR146.8SeqFormer (ResNet-101)
Video Instance SegmentationYouTube-VIS validationAR1056.9SeqFormer (ResNet-101)
Video Instance SegmentationYouTube-VIS validationmask AP49SeqFormer (ResNet-101)
Video Instance SegmentationYouTube-VIS validationAP5069.8SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAP7551.8SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR145.5SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR1054.8SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationmask AP47.4SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAP5066.9SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAP7550.5SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR145.6SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationAR1054.6SeqFormer (ResNet-50)
Video Instance SegmentationYouTube-VIS validationmask AP45.1SeqFormer (ResNet-50)
Video Instance SegmentationHQ-YTVISTube-Boundary AP43.3SeqFormer (Swin-L)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15