DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, Antoni B. Chan

2023-04-02CVPR 2023 1Visual Object Tracking Semantic Segmentation Video Object Segmentation Object Tracking Video Semantic Segmentation

Paper PDF Code(official)

Abstract

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNetbased MAE with 2X faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git.

Results

Task	Dataset	Metric	Value	Model
Object Tracking	TNL2K	AUC	56.9	DropTrack
Object Tracking	TNL2K	precision	57.9	DropTrack
Object Tracking	LaSOT	AUC	71.8	DropTrack
Object Tracking	LaSOT	Normalized Precision	81.8	DropTrack
Object Tracking	LaSOT	Precision	78.1	DropTrack
Object Tracking	GOT-10k	Average Overlap	75.9	DropMAE
Object Tracking	GOT-10k	Success Rate 0.5	86.8	DropMAE
Object Tracking	GOT-10k	Success Rate 0.75	72	DropMAE
Object Tracking	LaSOT-ext	AUC	52.7	DropTrack
Object Tracking	LaSOT-ext	Precision	60.2	DropTrack
Object Tracking	ITB	AUC	0.65	DropTrack
Object Tracking	TrackingNet	AUC	0.841	DropTrack
Object Tracking	TrackingNet	Normalized Precision	88.9	DropTrack
Visual Object Tracking	TNL2K	AUC	56.9	DropTrack
Visual Object Tracking	TNL2K	precision	57.9	DropTrack
Visual Object Tracking	LaSOT	AUC	71.8	DropTrack
Visual Object Tracking	LaSOT	Normalized Precision	81.8	DropTrack
Visual Object Tracking	LaSOT	Precision	78.1	DropTrack
Visual Object Tracking	GOT-10k	Average Overlap	75.9	DropMAE
Visual Object Tracking	GOT-10k	Success Rate 0.5	86.8	DropMAE
Visual Object Tracking	GOT-10k	Success Rate 0.75	72	DropMAE
Visual Object Tracking	LaSOT-ext	AUC	52.7	DropTrack
Visual Object Tracking	LaSOT-ext	Precision	60.2	DropTrack
Visual Object Tracking	ITB	AUC	0.65	DropTrack
Visual Object Tracking	TrackingNet	AUC	0.841	DropTrack
Visual Object Tracking	TrackingNet	Normalized Precision	88.9	DropTrack

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

Abstract

Results

Related Papers

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

Abstract

Results

Related Papers