TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GRAtt-VIS: Gated Residual Attention for Auto Rectifying Vi...

GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

Tanveer Hannan, Rajat Koner, Maximilian Bernhard, Suprosanna Shit, Bjoern Menze, Volker Tresp, Matthias Schubert, Thomas Seidl

2023-05-26Semantic SegmentationInstance SegmentationVideo Instance Segmentation
PaperPDFCode(official)

Abstract

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce \textbf{GRAtt-VIS}, \textbf{G}ated \textbf{R}esidual \textbf{Att}ention for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as \textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods. Code is available at \url{https://github.com/Tanveer81/GRAttVIS}.

Results

TaskDatasetMetricValueModel
Video Instance SegmentationYouTube-VIS 2021AP5081.3GRAtt-VIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP7567.1GRAtt-VIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR148.8GRAtt-VIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR1064.5GRAtt-VIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021mask AP60.3GRAtt-VIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP5069.2GRAtt-VIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AP7553.1GRAtt-VIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR141.8GRAtt-VIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR1056GRAtt-VIS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021mask AP48.9GRAtt-VIS (ResNet-50)
Video Instance SegmentationOVIS validationAP5069.1GRAtt-VIS (Swin-L)
Video Instance SegmentationOVIS validationAP7547.8GRAtt-VIS (Swin-L)
Video Instance SegmentationOVIS validationAR119.2GRAtt-VIS (Swin-L)
Video Instance SegmentationOVIS validationAR1049.4GRAtt-VIS (Swin-L)
Video Instance SegmentationOVIS validationmask AP45.7GRAtt-VIS (Swin-L)
Video Instance SegmentationOVIS validationAP5060.8GRAtt-VIS (ResNet-50)
Video Instance SegmentationOVIS validationAP7536.8GRAtt-VIS (ResNet-50)
Video Instance SegmentationOVIS validationAR116.8GRAtt-VIS (ResNet-50)
Video Instance SegmentationOVIS validationAR1040.1GRAtt-VIS (ResNet-50)
Video Instance SegmentationOVIS validationmask AP36.2GRAtt-VIS (ResNet-50)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15