Jiayi Zhao, Fei Teng, Kai Luo, Guoqiang Zhao, Zhiyong Li, Xu Zheng, Kailun Yang
The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB-T tasks. To address these challenges, we propose SHIFNet, a novel SAM2-driven Hybrid Interaction Paradigm that unlocks the potential of SAM2 with linguistic guidance for efficient RGB-Thermal perception. Our framework consists of two key components: (1) Semantic-Aware Cross-modal Fusion (SACF) module that dynamically balances modality contributions through text-guided affinity learning, overcoming SAM2's inherent RGB bias; (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information through a semantic enhancement module and then combined with category embeddings to amplify cross-modal semantic consistency. With 32.27M trainable parameters, SHIFNet achieves state-of-the-art segmentation performance on public benchmarks, reaching 89.8% on PST900 and 67.8% on FMB, respectively. The framework facilitates the adaptation of pre-trained large models to RGB-T segmentation tasks, effectively mitigating the high costs associated with data collection while endowing robotic systems with comprehensive perception capabilities. The source code will be made publicly available at https://github.com/iAsakiT3T/SHIFNet.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | FMB Dataset | mIoU | 67.8 | SHIFNet (RGB-Infrared) |
| Semantic Segmentation | PST900 | mIoU | 89.8 | SHIFNet |
| Semantic Segmentation | MFN Dataset | mIOU | 59.2 | SHIFNet |
| Scene Segmentation | PST900 | mIoU | 89.8 | SHIFNet |
| Scene Segmentation | MFN Dataset | mIOU | 59.2 | SHIFNet |
| 2D Object Detection | PST900 | mIoU | 89.8 | SHIFNet |
| 2D Object Detection | MFN Dataset | mIOU | 59.2 | SHIFNet |
| 10-shot image generation | FMB Dataset | mIoU | 67.8 | SHIFNet (RGB-Infrared) |
| 10-shot image generation | PST900 | mIoU | 89.8 | SHIFNet |
| 10-shot image generation | MFN Dataset | mIOU | 59.2 | SHIFNet |