Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | KITTI-360 | mIoU | 57.44 | TokenFusion (RGB-Depth) |
| Semantic Segmentation | KITTI-360 | mIoU | 54.55 | TokenFusion (RGB-LiDAR) |
| Semantic Segmentation | LLRGBD-synthetic | mIoU | 64.75 | TokenFusion (SegFormer-B2) |
| Semantic Segmentation | DeLiVER | mIoU | 60.25 | TokenFusion (RGB-Depth) |
| Semantic Segmentation | DeLiVER | mIoU | 53.01 | TokenFusion (RGB-LiDAR) |
| Semantic Segmentation | DeLiVER | mIoU | 45.63 | TokenFusion (RGB-Event) |
| Object Detection | SUN-RGBD val | mAP@0.25 | 64.9 | TokenFusion |
| Object Detection | SUN-RGBD val | mAP@0.5 | 48.3 | TokenFusion |
| Object Detection | ScanNetV2 | mAP@0.25 | 70.8 | TokenFusion |
| Object Detection | ScanNetV2 | mAP@0.5 | 54.2 | TokenFusion |
| 3D | SUN-RGBD val | mAP@0.25 | 64.9 | TokenFusion |
| 3D | SUN-RGBD val | mAP@0.5 | 48.3 | TokenFusion |
| 3D | ScanNetV2 | mAP@0.25 | 70.8 | TokenFusion |
| 3D | ScanNetV2 | mAP@0.5 | 54.2 | TokenFusion |
| 3D Object Detection | SUN-RGBD val | mAP@0.25 | 64.9 | TokenFusion |
| 3D Object Detection | SUN-RGBD val | mAP@0.5 | 48.3 | TokenFusion |
| 3D Object Detection | ScanNetV2 | mAP@0.25 | 70.8 | TokenFusion |
| 3D Object Detection | ScanNetV2 | mAP@0.5 | 54.2 | TokenFusion |
| 2D Classification | SUN-RGBD val | mAP@0.25 | 64.9 | TokenFusion |
| 2D Classification | SUN-RGBD val | mAP@0.5 | 48.3 | TokenFusion |
| 2D Classification | ScanNetV2 | mAP@0.25 | 70.8 | TokenFusion |
| 2D Classification | ScanNetV2 | mAP@0.5 | 54.2 | TokenFusion |
| 2D Object Detection | SUN-RGBD val | mAP@0.25 | 64.9 | TokenFusion |
| 2D Object Detection | SUN-RGBD val | mAP@0.5 | 48.3 | TokenFusion |
| 2D Object Detection | ScanNetV2 | mAP@0.25 | 70.8 | TokenFusion |
| 2D Object Detection | ScanNetV2 | mAP@0.5 | 54.2 | TokenFusion |
| 10-shot image generation | KITTI-360 | mIoU | 57.44 | TokenFusion (RGB-Depth) |
| 10-shot image generation | KITTI-360 | mIoU | 54.55 | TokenFusion (RGB-LiDAR) |
| 10-shot image generation | LLRGBD-synthetic | mIoU | 64.75 | TokenFusion (SegFormer-B2) |
| 10-shot image generation | DeLiVER | mIoU | 60.25 | TokenFusion (RGB-Depth) |
| 10-shot image generation | DeLiVER | mIoU | 53.01 | TokenFusion (RGB-LiDAR) |
| 10-shot image generation | DeLiVER | mIoU | 45.63 | TokenFusion (RGB-Event) |
| 16k | SUN-RGBD val | mAP@0.25 | 64.9 | TokenFusion |
| 16k | SUN-RGBD val | mAP@0.5 | 48.3 | TokenFusion |
| 16k | ScanNetV2 | mAP@0.25 | 70.8 | TokenFusion |
| 16k | ScanNetV2 | mAP@0.5 | 54.2 | TokenFusion |