TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DropMAE: Masked Autoencoders with Spatial-Attention Dropou...

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, Antoni B. Chan

2023-04-02CVPR 2023 1Visual Object TrackingSemantic SegmentationVideo Object SegmentationObject TrackingVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNetbased MAE with 2X faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git.

Results

TaskDatasetMetricValueModel
Object TrackingTNL2KAUC56.9DropTrack
Object TrackingTNL2Kprecision57.9DropTrack
Object TrackingLaSOTAUC71.8DropTrack
Object TrackingLaSOTNormalized Precision81.8DropTrack
Object TrackingLaSOTPrecision78.1DropTrack
Object TrackingGOT-10kAverage Overlap75.9DropMAE
Object TrackingGOT-10kSuccess Rate 0.586.8DropMAE
Object TrackingGOT-10kSuccess Rate 0.7572DropMAE
Object TrackingLaSOT-extAUC52.7DropTrack
Object TrackingLaSOT-extPrecision60.2DropTrack
Object TrackingITBAUC0.65DropTrack
Object TrackingTrackingNetAUC0.841DropTrack
Object TrackingTrackingNetNormalized Precision88.9DropTrack
Visual Object TrackingTNL2KAUC56.9DropTrack
Visual Object TrackingTNL2Kprecision57.9DropTrack
Visual Object TrackingLaSOTAUC71.8DropTrack
Visual Object TrackingLaSOTNormalized Precision81.8DropTrack
Visual Object TrackingLaSOTPrecision78.1DropTrack
Visual Object TrackingGOT-10kAverage Overlap75.9DropMAE
Visual Object TrackingGOT-10kSuccess Rate 0.586.8DropMAE
Visual Object TrackingGOT-10kSuccess Rate 0.7572DropMAE
Visual Object TrackingLaSOT-extAUC52.7DropTrack
Visual Object TrackingLaSOT-extPrecision60.2DropTrack
Visual Object TrackingITBAUC0.65DropTrack
Visual Object TrackingTrackingNetAUC0.841DropTrack
Visual Object TrackingTrackingNetNormalized Precision88.9DropTrack

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association2025-07-16