TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MixFormer: End-to-End Tracking with Iterative Mixed Attent...

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Yutao Cui, Cheng Jiang, Gangshan Wu, LiMin Wang

2023-02-06Visual Object TrackingObject Tracking
PaperPDFCode(official)

Abstract

Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT, and a non-hierarchical tracker MixViT. For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked pre-training to our MixFormer trackers and design the competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, OTB100 and UAV123. In particular, our MixViT-L achieves AUC score of 73.3% on LaSOT, 86.1% on TrackingNet, EAO of 0.584 on VOT2020, and AO of 75.7% on GOT-10k. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

Results

TaskDatasetMetricValueModel
Object TrackingLaSOTAUC73.3MixViT-L(ConvMAE)
Object TrackingLaSOTNormalized Precision82.8MixViT-L(ConvMAE)
Object TrackingLaSOTPrecision80.3MixViT-L(ConvMAE)
Object TrackingGOT-10kAverage Overlap75.7MixViT-L(ConvMAE)
Object TrackingGOT-10kSuccess Rate 0.585.3MixViT-L(ConvMAE)
Object TrackingGOT-10kSuccess Rate 0.7575.1MixViT-L(ConvMAE)
Object TrackingTrackingNetAccuracy86.1MixViT-L(ConvMAE)
Object TrackingTrackingNetNormalized Precision90.3MixViT-L(ConvMAE)
Object TrackingTrackingNetPrecision86MixViT-L(ConvMAE)
Object TrackingVOT2022EAO0.589MixFormerM
Visual Object TrackingLaSOTAUC73.3MixViT-L(ConvMAE)
Visual Object TrackingLaSOTNormalized Precision82.8MixViT-L(ConvMAE)
Visual Object TrackingLaSOTPrecision80.3MixViT-L(ConvMAE)
Visual Object TrackingGOT-10kAverage Overlap75.7MixViT-L(ConvMAE)
Visual Object TrackingGOT-10kSuccess Rate 0.585.3MixViT-L(ConvMAE)
Visual Object TrackingGOT-10kSuccess Rate 0.7575.1MixViT-L(ConvMAE)
Visual Object TrackingTrackingNetAccuracy86.1MixViT-L(ConvMAE)
Visual Object TrackingTrackingNetNormalized Precision90.3MixViT-L(ConvMAE)
Visual Object TrackingTrackingNetPrecision86MixViT-L(ConvMAE)
Visual Object TrackingVOT2022EAO0.589MixFormerM

Related Papers

MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association2025-07-16HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking2025-07-07UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions2025-07-01Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking2025-06-30Visual and Memory Dual Adapter for Multi-Modal Object Tracking2025-06-30R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning2025-06-27