TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GeminiFusion: Efficient Pixel-wise Multimodal Fusion for V...

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, Xinghao Chen

2024-06-03Semantic Segmentationobject-detection3D Object DetectionObject DetectionImage-to-Image Translation
PaperPDFCode(official)

Abstract

Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion

Results

TaskDatasetMetricValueModel
Semantic SegmentationDELIVERmIoU66.9GeminiFusion
Semantic SegmentationSUN-RGBDMean IoU54.6GeminiFusion (Swin-Large)
Semantic SegmentationSUN-RGBDMean IoU53.3GeminiFusion (MiT-B5)
Semantic SegmentationSUN-RGBDMean IoU52.7GeminiFusion (MiT-B3)
Semantic SegmentationNYU Depth v2Mean IoU60.9GeminiFusion (Swin-Large)
Semantic SegmentationNYU Depth v2Mean IoU60.2GeminiFusion (Swin-Large)
Semantic SegmentationNYU Depth v2Mean IoU57.7GeminiFusion (MiT-B5)
Semantic SegmentationNYU Depth v2Mean IoU56.8GeminiFusion (MiT-B3)
Semantic SegmentationDeLiVER mIoU66.9GeminiFusion
10-shot image generationDELIVERmIoU66.9GeminiFusion
10-shot image generationSUN-RGBDMean IoU54.6GeminiFusion (Swin-Large)
10-shot image generationSUN-RGBDMean IoU53.3GeminiFusion (MiT-B5)
10-shot image generationSUN-RGBDMean IoU52.7GeminiFusion (MiT-B3)
10-shot image generationNYU Depth v2Mean IoU60.9GeminiFusion (Swin-Large)
10-shot image generationNYU Depth v2Mean IoU60.2GeminiFusion (Swin-Large)
10-shot image generationNYU Depth v2Mean IoU57.7GeminiFusion (MiT-B5)
10-shot image generationNYU Depth v2Mean IoU56.8GeminiFusion (MiT-B3)
10-shot image generationDeLiVER mIoU66.9GeminiFusion

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17