TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SSTVOS: Sparse Spatiotemporal Transformers for Video Objec...

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, Graham W. Taylor

2021-01-21CVPR 2021 1Visual Object TrackingSemi-Supervised Video Object SegmentationMotion SegmentationOne-shot visual object segmentationSegmentationSemantic SegmentationVideo Object SegmentationVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art. Code is available at https://github.com/dukebw/SSTVOS.

Results

TaskDatasetMetricValueModel
VideoYouTube-VOS 2019Jaccard (Seen)80.9SST
VideoYouTube-VOS 2019Jaccard (Unseen)76.6SST
VideoYouTube-VOS 2019Mean Jaccard & F-Measure81.8SST
VideoYouTube-VOS 2018Jaccard (Seen)80.9SST (Local)
VideoYouTube-VOS 2018Jaccard (Unseen)76.6SST (Local)
VideoDAVIS (no YouTube-VOS training)D17 val (F)81.4SSTVOS
VideoDAVIS (no YouTube-VOS training)D17 val (G)78.4SSTVOS
VideoDAVIS (no YouTube-VOS training)D17 val (J)75.4SSTVOS
Video Object SegmentationYouTube-VOS 2019Jaccard (Seen)80.9SST
Video Object SegmentationYouTube-VOS 2019Jaccard (Unseen)76.6SST
Video Object SegmentationYouTube-VOS 2019Mean Jaccard & F-Measure81.8SST
Video Object SegmentationYouTube-VOS 2018Jaccard (Seen)80.9SST (Local)
Video Object SegmentationYouTube-VOS 2018Jaccard (Unseen)76.6SST (Local)
Video Object SegmentationDAVIS (no YouTube-VOS training)D17 val (F)81.4SSTVOS
Video Object SegmentationDAVIS (no YouTube-VOS training)D17 val (G)78.4SSTVOS
Video Object SegmentationDAVIS (no YouTube-VOS training)D17 val (J)75.4SSTVOS
Semi-Supervised Video Object SegmentationDAVIS (no YouTube-VOS training)D17 val (F)81.4SSTVOS
Semi-Supervised Video Object SegmentationDAVIS (no YouTube-VOS training)D17 val (G)78.4SSTVOS
Semi-Supervised Video Object SegmentationDAVIS (no YouTube-VOS training)D17 val (J)75.4SSTVOS

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17