Visual Prompt Multi-Modal Tracking

Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu

2023-03-20CVPR 2023 1Rgb-T Tracking Object Tracking

Abstract

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

Results

Task	Dataset	Metric	Value	Model
Visual Tracking	LasHeR	Precision	65.1	ViPT
Visual Tracking	LasHeR	Success	52.5	ViPT
Visual Tracking	RGBT234	Precision	83.5	ViPT
Visual Tracking	RGBT234	Success	61.7	ViPT

Related Papers

MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17 YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association2025-07-16 HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10 Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking2025-07-07 UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions2025-07-01 Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking2025-06-30 Visual and Memory Dual Adapter for Multi-Modal Object Tracking2025-06-30 R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning2025-06-27