TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Transformer-based RGB-T Tracking with Channel and Spatial ...

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Yunfeng Li, Bo wang, Ye Li, Zhiwen Yu, Liang Wang

2024-05-06Rgb-T Tracking
PaperPDFCode(official)

Abstract

How to better fuse cross-modal features is the core issue of RGB-T tracking. Some previous methods either insufficiently fuse RGB and TIR features, or depend on intermediaries containing information from both modalities to achieve cross-modal information interaction. The former does not fully exploit the potential of using only RGB and TIR information of the template or search region for channel and spatial feature fusion, and the latter lacks direct interaction between the template and search area, which limits the model's ability to fully exploit the original semantic information of both modalities. To alleviate these limitations, we explore how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features, and propose CSTNet. CSTNet uses ViT as a backbone and inserts cross-modal channel feature fusion modules (CFM) and cross-modal spatial feature fusion modules (SFM) for direct interaction between RGB and TIR features. The CFM performs parallel joint channel enhancement and joint multilevel spatial feature modeling of RGB and TIR features and sums the features, and then globally integrates the sum feature with the original features. The SFM uses cross-attention to model the spatial relationship of cross-modal features and then introduces a convolutional feedforward network for joint spatial and channel integration of multimodal features. We retrain the model with CSNet as the pre-training weights in the model with CFM and SFM removed, and propose CSTNet-small, which achieves 36% reduction in parameters and 24% reduction in Flops, and 50% speedup with a 1-2% performance decrease. Comprehensive experiments show that CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks. Code is available at https://github.com/LiYunfengLYF/CSTNet.

Results

TaskDatasetMetricValueModel
Visual TrackingLasHeRPrecision71.5CSTNet
Visual TrackingLasHeRSuccess57.2CSTNet
Visual TrackingRGBT234Precision88.4CSTNet
Visual TrackingRGBT234Success65.2CSTNet
Visual TrackingRGBT210Precision86CSTNet
Visual TrackingRGBT210Success63.5CSTNet

Related Papers

Lightweight RGB-T Tracking with Mobile Vision Transformers2025-06-23Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking2025-05-06Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking2025-03-14Adaptive Perception for Unified Visual Multi-modal Object Tracking2025-02-10BTMTrack: Robust RGB-T Tracking via Dual-template Bridging and Temporal-Modal Candidate Elimination2025-01-07PURA: Parameter Update-Recovery Test-Time Adaption for RGB-T Tracking2025-01-01SUTrack: Towards Simple and Unified Single Object Tracking2024-12-26Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking2024-12-20