TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multimodal Token Fusion for Vision Transformers

Multimodal Token Fusion for Vision Transformers

Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang

2022-04-19journal 2022 7Semantic Segmentationobject-detection3D Object DetectionObject DetectionImage-to-Image Translation
PaperPDFCodeCodeCodeCodeCode(official)CodeCode(official)CodeCode(official)CodeCode

Abstract

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.

Results

TaskDatasetMetricValueModel
Semantic SegmentationKITTI-360mIoU57.44TokenFusion (RGB-Depth)
Semantic SegmentationKITTI-360mIoU54.55TokenFusion (RGB-LiDAR)
Semantic SegmentationLLRGBD-syntheticmIoU64.75TokenFusion (SegFormer-B2)
Semantic SegmentationDeLiVER mIoU60.25TokenFusion (RGB-Depth)
Semantic SegmentationDeLiVER mIoU53.01TokenFusion (RGB-LiDAR)
Semantic SegmentationDeLiVER mIoU45.63TokenFusion (RGB-Event)
Object DetectionSUN-RGBD valmAP@0.2564.9TokenFusion
Object DetectionSUN-RGBD valmAP@0.548.3TokenFusion
Object DetectionScanNetV2mAP@0.2570.8TokenFusion
Object DetectionScanNetV2mAP@0.554.2TokenFusion
3DSUN-RGBD valmAP@0.2564.9TokenFusion
3DSUN-RGBD valmAP@0.548.3TokenFusion
3DScanNetV2mAP@0.2570.8TokenFusion
3DScanNetV2mAP@0.554.2TokenFusion
3D Object DetectionSUN-RGBD valmAP@0.2564.9TokenFusion
3D Object DetectionSUN-RGBD valmAP@0.548.3TokenFusion
3D Object DetectionScanNetV2mAP@0.2570.8TokenFusion
3D Object DetectionScanNetV2mAP@0.554.2TokenFusion
2D ClassificationSUN-RGBD valmAP@0.2564.9TokenFusion
2D ClassificationSUN-RGBD valmAP@0.548.3TokenFusion
2D ClassificationScanNetV2mAP@0.2570.8TokenFusion
2D ClassificationScanNetV2mAP@0.554.2TokenFusion
2D Object DetectionSUN-RGBD valmAP@0.2564.9TokenFusion
2D Object DetectionSUN-RGBD valmAP@0.548.3TokenFusion
2D Object DetectionScanNetV2mAP@0.2570.8TokenFusion
2D Object DetectionScanNetV2mAP@0.554.2TokenFusion
10-shot image generationKITTI-360mIoU57.44TokenFusion (RGB-Depth)
10-shot image generationKITTI-360mIoU54.55TokenFusion (RGB-LiDAR)
10-shot image generationLLRGBD-syntheticmIoU64.75TokenFusion (SegFormer-B2)
10-shot image generationDeLiVER mIoU60.25TokenFusion (RGB-Depth)
10-shot image generationDeLiVER mIoU53.01TokenFusion (RGB-LiDAR)
10-shot image generationDeLiVER mIoU45.63TokenFusion (RGB-Event)
16kSUN-RGBD valmAP@0.2564.9TokenFusion
16kSUN-RGBD valmAP@0.548.3TokenFusion
16kScanNetV2mAP@0.2570.8TokenFusion
16kScanNetV2mAP@0.554.2TokenFusion

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17