Multimodal Token Fusion for Vision Transformers

Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang

2022-04-19journal 2022 7Semantic Segmentation object-detection 3D Object Detection Object Detection Image-to-Image Translation

Paper PDF Code Code Code Code Code(official)Code Code(official)Code Code(official)Code Code

Abstract

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	KITTI-360	mIoU	57.44	TokenFusion (RGB-Depth)
Semantic Segmentation	KITTI-360	mIoU	54.55	TokenFusion (RGB-LiDAR)
Semantic Segmentation	LLRGBD-synthetic	mIoU	64.75	TokenFusion (SegFormer-B2)
Semantic Segmentation	DeLiVER	mIoU	60.25	TokenFusion (RGB-Depth)
Semantic Segmentation	DeLiVER	mIoU	53.01	TokenFusion (RGB-LiDAR)
Semantic Segmentation	DeLiVER	mIoU	45.63	TokenFusion (RGB-Event)
Object Detection	SUN-RGBD val	mAP@0.25	64.9	TokenFusion
Object Detection	SUN-RGBD val	mAP@0.5	48.3	TokenFusion
Object Detection	ScanNetV2	mAP@0.25	70.8	TokenFusion
Object Detection	ScanNetV2	mAP@0.5	54.2	TokenFusion
3D	SUN-RGBD val	mAP@0.25	64.9	TokenFusion
3D	SUN-RGBD val	mAP@0.5	48.3	TokenFusion
3D	ScanNetV2	mAP@0.25	70.8	TokenFusion
3D	ScanNetV2	mAP@0.5	54.2	TokenFusion
3D Object Detection	SUN-RGBD val	mAP@0.25	64.9	TokenFusion
3D Object Detection	SUN-RGBD val	mAP@0.5	48.3	TokenFusion
3D Object Detection	ScanNetV2	mAP@0.25	70.8	TokenFusion
3D Object Detection	ScanNetV2	mAP@0.5	54.2	TokenFusion
2D Classification	SUN-RGBD val	mAP@0.25	64.9	TokenFusion
2D Classification	SUN-RGBD val	mAP@0.5	48.3	TokenFusion
2D Classification	ScanNetV2	mAP@0.25	70.8	TokenFusion
2D Classification	ScanNetV2	mAP@0.5	54.2	TokenFusion
2D Object Detection	SUN-RGBD val	mAP@0.25	64.9	TokenFusion
2D Object Detection	SUN-RGBD val	mAP@0.5	48.3	TokenFusion
2D Object Detection	ScanNetV2	mAP@0.25	70.8	TokenFusion
2D Object Detection	ScanNetV2	mAP@0.5	54.2	TokenFusion
10-shot image generation	KITTI-360	mIoU	57.44	TokenFusion (RGB-Depth)
10-shot image generation	KITTI-360	mIoU	54.55	TokenFusion (RGB-LiDAR)
10-shot image generation	LLRGBD-synthetic	mIoU	64.75	TokenFusion (SegFormer-B2)
10-shot image generation	DeLiVER	mIoU	60.25	TokenFusion (RGB-Depth)
10-shot image generation	DeLiVER	mIoU	53.01	TokenFusion (RGB-LiDAR)
10-shot image generation	DeLiVER	mIoU	45.63	TokenFusion (RGB-Event)
16k	SUN-RGBD val	mAP@0.25	64.9	TokenFusion
16k	SUN-RGBD val	mAP@0.5	48.3	TokenFusion
16k	ScanNetV2	mAP@0.25	70.8	TokenFusion
16k	ScanNetV2	mAP@0.5	54.2	TokenFusion

Multimodal Token Fusion for Vision Transformers

Abstract

Results

Related Papers

Multimodal Token Fusion for Vision Transformers

Abstract

Results

Related Papers