TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SwinTrack: A Simple and Strong Baseline for Transformer Tr...

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, Haibin Ling

2021-12-02Visual Object TrackingVisual TrackingRepresentation Learning
PaperPDFCode(official)

Abstract

Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet efficient fully-attentional tracker, dubbed SwinTrack, within classic Siamese framework. In particular, both representation learning and feature fusion in SwinTrack leverage the Transformer architecture, enabling better feature interactions for tracking than pure CNN or hybrid CNN-Transformer frameworks. Besides, to further enhance robustness, we present a novel motion token that embeds historical target trajectory to improve tracking by providing temporal context. Our motion token is lightweight with negligible computation but brings clear gains. In our thorough experiments, SwinTrack exceeds existing approaches on multiple benchmarks. Particularly, on the challenging LaSOT, SwinTrack sets a new record with 0.713 SUC score. It also achieves SOTA results on other benchmarks. We expect SwinTrack to serve as a solid baseline for Transformer tracking and facilitate future research. Our codes and results are released at https://github.com/LitingLin/SwinTrack.

Results

TaskDatasetMetricValueModel
Object TrackingLaSOTAUC70.2SwinTrack-B-384
Object TrackingLaSOTNormalized Precision78.4SwinTrack-B-384
Object TrackingLaSOTPrecision75.3SwinTrack-B-384
Object TrackingGOT-10kAverage Overlap69.4SwinTrack-B
Object TrackingGOT-10kSuccess Rate 0.578SwinTrack-B
Object TrackingGOT-10kSuccess Rate 0.7564.3SwinTrack-B
Object TrackingTrackingNetAccuracy84SwinTrack-B-384
Object TrackingTrackingNetNormalized Precision88.2SwinTrack-B-384
Object TrackingTrackingNetPrecision83.2SwinTrack-B-384
Visual Object TrackingLaSOTAUC70.2SwinTrack-B-384
Visual Object TrackingLaSOTNormalized Precision78.4SwinTrack-B-384
Visual Object TrackingLaSOTPrecision75.3SwinTrack-B-384
Visual Object TrackingGOT-10kAverage Overlap69.4SwinTrack-B
Visual Object TrackingGOT-10kSuccess Rate 0.578SwinTrack-B
Visual Object TrackingGOT-10kSuccess Rate 0.7564.3SwinTrack-B
Visual Object TrackingTrackingNetAccuracy84SwinTrack-B-384
Visual Object TrackingTrackingNetNormalized Precision88.2SwinTrack-B-384
Visual Object TrackingTrackingNetPrecision83.2SwinTrack-B-384

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15Dual Dimensions Geometric Representation Learning Based Document Dewarping2025-07-11