TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MixFormer: End-to-End Tracking with Iterative Mixed Attent...

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

2022-03-21CVPR 2022 1Visual Object TrackingSemi-Supervised Video Object SegmentationVideo Object Tracking
PaperPDFCode(official)

Abstract

Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

Results

TaskDatasetMetricValueModel
VideoVOT2020EAO0.555MixFormer-L
VideoNT-VOT211AUC39.23Mixformer(ConvMAE)
VideoNT-VOT211Precision54.2Mixformer(ConvMAE)
Object TrackingUAV123AUC0.704MixFormer
Object TrackingUAV123Precision0.918MixFormer
Object TrackingLaSOTAUC70.1MixFormer-L
Object TrackingLaSOTNormalized Precision79.9MixFormer-L
Object TrackingLaSOTPrecision76.3MixFormer-L
Object TrackingGOT-10kAverage Overlap75.6MixFormer-L
Object TrackingGOT-10kSuccess Rate 0.585.73MixFormer-L
Object TrackingGOT-10kSuccess Rate 0.7572.8MixFormer-L
Object TrackingGOT-10kAverage Overlap71.2MixFormer-1k
Object TrackingGOT-10kSuccess Rate 0.579.9MixFormer-1k
Object TrackingGOT-10kSuccess Rate 0.7565.8MixFormer-1k
Object TrackingGOT-10kAverage Overlap70.7MixFormer
Object TrackingGOT-10kSuccess Rate 0.580MixFormer
Object TrackingGOT-10kSuccess Rate 0.7567.8MixFormer
Object TrackingAVisTSuccess Rate56MixFormerL-22k
Object TrackingTrackingNetAccuracy83.9MixFormer-L
Object TrackingTrackingNetNormalized Precision88.9MixFormer-L
Object TrackingTrackingNetPrecision83.1MixFormer-L
Object TrackingNT-VOT211AUC39.23Mixformer(ConvMAE)
Object TrackingNT-VOT211Precision54.2Mixformer(ConvMAE)
Video Object SegmentationVOT2020EAO0.555MixFormer-L
Semi-Supervised Video Object SegmentationVOT2020EAO0.555MixFormer-L
Visual Object TrackingUAV123AUC0.704MixFormer
Visual Object TrackingUAV123Precision0.918MixFormer
Visual Object TrackingLaSOTAUC70.1MixFormer-L
Visual Object TrackingLaSOTNormalized Precision79.9MixFormer-L
Visual Object TrackingLaSOTPrecision76.3MixFormer-L
Visual Object TrackingGOT-10kAverage Overlap75.6MixFormer-L
Visual Object TrackingGOT-10kSuccess Rate 0.585.73MixFormer-L
Visual Object TrackingGOT-10kSuccess Rate 0.7572.8MixFormer-L
Visual Object TrackingGOT-10kAverage Overlap71.2MixFormer-1k
Visual Object TrackingGOT-10kSuccess Rate 0.579.9MixFormer-1k
Visual Object TrackingGOT-10kSuccess Rate 0.7565.8MixFormer-1k
Visual Object TrackingGOT-10kAverage Overlap70.7MixFormer
Visual Object TrackingGOT-10kSuccess Rate 0.580MixFormer
Visual Object TrackingGOT-10kSuccess Rate 0.7567.8MixFormer
Visual Object TrackingAVisTSuccess Rate56MixFormerL-22k
Visual Object TrackingTrackingNetAccuracy83.9MixFormer-L
Visual Object TrackingTrackingNetNormalized Precision88.9MixFormer-L
Visual Object TrackingTrackingNetPrecision83.1MixFormer-L

Related Papers

HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions2025-07-01Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking2025-06-30R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning2025-06-27THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation2025-06-07Fully Spiking Neural Networks for Unified Frame-Event Object Tracking2025-05-27Progressive Scaling Visual Object Tracking2025-05-26Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking2025-05-23