MixFormer: End-to-End Tracking with Iterative Mixed Attention

Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

2022-03-21CVPR 2022 1Visual Object Tracking Semi-Supervised Video Object Segmentation Video Object Tracking

Abstract

Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

Results

Task	Dataset	Metric	Value	Model
Video	VOT2020	EAO	0.555	MixFormer-L
Video	NT-VOT211	AUC	39.23	Mixformer(ConvMAE)
Video	NT-VOT211	Precision	54.2	Mixformer(ConvMAE)
Object Tracking	UAV123	AUC	0.704	MixFormer
Object Tracking	UAV123	Precision	0.918	MixFormer
Object Tracking	LaSOT	AUC	70.1	MixFormer-L
Object Tracking	LaSOT	Normalized Precision	79.9	MixFormer-L
Object Tracking	LaSOT	Precision	76.3	MixFormer-L
Object Tracking	GOT-10k	Average Overlap	75.6	MixFormer-L
Object Tracking	GOT-10k	Success Rate 0.5	85.73	MixFormer-L
Object Tracking	GOT-10k	Success Rate 0.75	72.8	MixFormer-L
Object Tracking	GOT-10k	Average Overlap	71.2	MixFormer-1k
Object Tracking	GOT-10k	Success Rate 0.5	79.9	MixFormer-1k
Object Tracking	GOT-10k	Success Rate 0.75	65.8	MixFormer-1k
Object Tracking	GOT-10k	Average Overlap	70.7	MixFormer
Object Tracking	GOT-10k	Success Rate 0.5	80	MixFormer
Object Tracking	GOT-10k	Success Rate 0.75	67.8	MixFormer
Object Tracking	AVisT	Success Rate	56	MixFormerL-22k
Object Tracking	TrackingNet	Accuracy	83.9	MixFormer-L
Object Tracking	TrackingNet	Normalized Precision	88.9	MixFormer-L
Object Tracking	TrackingNet	Precision	83.1	MixFormer-L
Object Tracking	NT-VOT211	AUC	39.23	Mixformer(ConvMAE)
Object Tracking	NT-VOT211	Precision	54.2	Mixformer(ConvMAE)
Video Object Segmentation	VOT2020	EAO	0.555	MixFormer-L
Semi-Supervised Video Object Segmentation	VOT2020	EAO	0.555	MixFormer-L
Visual Object Tracking	UAV123	AUC	0.704	MixFormer
Visual Object Tracking	UAV123	Precision	0.918	MixFormer
Visual Object Tracking	LaSOT	AUC	70.1	MixFormer-L
Visual Object Tracking	LaSOT	Normalized Precision	79.9	MixFormer-L
Visual Object Tracking	LaSOT	Precision	76.3	MixFormer-L
Visual Object Tracking	GOT-10k	Average Overlap	75.6	MixFormer-L
Visual Object Tracking	GOT-10k	Success Rate 0.5	85.73	MixFormer-L
Visual Object Tracking	GOT-10k	Success Rate 0.75	72.8	MixFormer-L
Visual Object Tracking	GOT-10k	Average Overlap	71.2	MixFormer-1k
Visual Object Tracking	GOT-10k	Success Rate 0.5	79.9	MixFormer-1k
Visual Object Tracking	GOT-10k	Success Rate 0.75	65.8	MixFormer-1k
Visual Object Tracking	GOT-10k	Average Overlap	70.7	MixFormer
Visual Object Tracking	GOT-10k	Success Rate 0.5	80	MixFormer
Visual Object Tracking	GOT-10k	Success Rate 0.75	67.8	MixFormer
Visual Object Tracking	AVisT	Success Rate	56	MixFormerL-22k
Visual Object Tracking	TrackingNet	Accuracy	83.9	MixFormer-L
Visual Object Tracking	TrackingNet	Normalized Precision	88.9	MixFormer-L
Visual Object Tracking	TrackingNet	Precision	83.1	MixFormer-L

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	VOT2020	EAO	0.555	MixFormer-L
Video	NT-VOT211	AUC	39.23	Mixformer(ConvMAE)
Video	NT-VOT211	Precision	54.2	Mixformer(ConvMAE)
Object Tracking	UAV123	AUC	0.704	MixFormer
Object Tracking	UAV123	Precision	0.918	MixFormer
Object Tracking	LaSOT	AUC	70.1	MixFormer-L
Object Tracking	LaSOT	Normalized Precision	79.9	MixFormer-L
Object Tracking	LaSOT	Precision	76.3	MixFormer-L
Object Tracking	GOT-10k	Average Overlap	75.6	MixFormer-L
Object Tracking	GOT-10k	Success Rate 0.5	85.73	MixFormer-L
Object Tracking	GOT-10k	Success Rate 0.75	72.8	MixFormer-L
Object Tracking	GOT-10k	Average Overlap	71.2	MixFormer-1k
Object Tracking	GOT-10k	Success Rate 0.5	79.9	MixFormer-1k
Object Tracking	GOT-10k	Success Rate 0.75	65.8	MixFormer-1k
Object Tracking	GOT-10k	Average Overlap	70.7	MixFormer
Object Tracking	GOT-10k	Success Rate 0.5	80	MixFormer
Object Tracking	GOT-10k	Success Rate 0.75	67.8	MixFormer
Object Tracking	AVisT	Success Rate	56	MixFormerL-22k
Object Tracking	TrackingNet	Accuracy	83.9	MixFormer-L
Object Tracking	TrackingNet	Normalized Precision	88.9	MixFormer-L
Object Tracking	TrackingNet	Precision	83.1	MixFormer-L
Object Tracking	NT-VOT211	AUC	39.23	Mixformer(ConvMAE)
Object Tracking	NT-VOT211	Precision	54.2	Mixformer(ConvMAE)
Video Object Segmentation	VOT2020	EAO	0.555	MixFormer-L
Semi-Supervised Video Object Segmentation	VOT2020	EAO	0.555	MixFormer-L
Visual Object Tracking	UAV123	AUC	0.704	MixFormer
Visual Object Tracking	UAV123	Precision	0.918	MixFormer
Visual Object Tracking	LaSOT	AUC	70.1	MixFormer-L
Visual Object Tracking	LaSOT	Normalized Precision	79.9	MixFormer-L
Visual Object Tracking	LaSOT	Precision	76.3	MixFormer-L
Visual Object Tracking	GOT-10k	Average Overlap	75.6	MixFormer-L
Visual Object Tracking	GOT-10k	Success Rate 0.5	85.73	MixFormer-L
Visual Object Tracking	GOT-10k	Success Rate 0.75	72.8	MixFormer-L
Visual Object Tracking	GOT-10k	Average Overlap	71.2	MixFormer-1k
Visual Object Tracking	GOT-10k	Success Rate 0.5	79.9	MixFormer-1k
Visual Object Tracking	GOT-10k	Success Rate 0.75	65.8	MixFormer-1k
Visual Object Tracking	GOT-10k	Average Overlap	70.7	MixFormer
Visual Object Tracking	GOT-10k	Success Rate 0.5	80	MixFormer
Visual Object Tracking	GOT-10k	Success Rate 0.75	67.8	MixFormer
Visual Object Tracking	AVisT	Success Rate	56	MixFormerL-22k
Visual Object Tracking	TrackingNet	Accuracy	83.9	MixFormer-L
Visual Object Tracking	TrackingNet	Normalized Precision	88.9	MixFormer-L
Visual Object Tracking	TrackingNet	Precision	83.1	MixFormer-L

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Abstract

Results

Related Papers

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Abstract

Results

Related Papers