TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation wi...

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruiping Liu, Rainer Stiefelhagen

2022-03-09Autonomous VehiclesThermal Image SegmentationCamouflaged Object SegmentationScene UnderstandingSegmentationSemantic SegmentationMultispectral Object DetectionImage Manipulation LocalizationPedestrian Detection3D Object DetectionObject DetectionImage Segmentation
PaperPDFCode(official)

Abstract

Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality (X-modality). However, covering a wide variety of sensors with a modality-agnostic model remains an unresolved problem due to variations in sensor characteristics among different modalities. Unlike previous modality-specific methods, in this work, we propose a unified fusion framework, CMX, for RGB-X semantic segmentation. To generalize well across different modalities, that often include supplements as well as uncertainties, a unified cross-modal interaction is crucial for modality fusion. Specifically, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate bi-modal features by leveraging the features from one modality to rectify the features of the other modality. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to perform sufficient exchange of long-range contexts before mixing. To verify CMX, for the first time, we unify five modalities complementary to RGB, i.e., depth, thermal, polarization, event, and LiDAR. Extensive experiments show that CMX generalizes well to diverse multi-modal fusion, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.

Results

TaskDatasetMetricValueModel
Autonomous VehiclesDVTOD mAP81.6CMX
Autonomous VehiclesLLVIPAP0.596CMX
Autonomous VehiclesCVC14AP5068.9CMX
Semantic Segmentation US3DmIoU84.63CMX
Semantic SegmentationUPLightmIoU92.13CMX (B2 RGB-AoLP)
Semantic SegmentationUPLightmIoU92.07CMX (B2 RGB-DoLP)
Semantic SegmentationKITTI-360mIoU64.43CMX (RGB-Depth)
Semantic SegmentationKITTI-360mIoU64.31CMX (RGB-LiDAR)
Semantic SegmentationPortoIoU72.85CMX
Semantic SegmentationReplicamIoU17CMX
Semantic SegmentationDSECmIoU72.42CMX
Semantic SegmentationSYN-UDTIRIIoU93.31CMX
Semantic Segmentation Synthetic Bathing Perception mIoU94.2CMX-SRA
Semantic Segmentation Synthetic Bathing Perception mIoU88.23CMX
Semantic SegmentationLLRGBD-syntheticmIoU66.52CMX (SegFormer-B2)
Semantic SegmentationCityscapes valmIoU82.6CMX (B4)
Semantic SegmentationCityscapes valmIoU81.6CMX (B2)
Semantic SegmentationSELMAmIoU91.7CMX
Semantic SegmentationZJU-RGB-PmIoU92.6CMX (B4 RGB-AoLP)
Semantic SegmentationZJU-RGB-PmIoU92.2CMX (B2 RGB-DoLP)
Semantic SegmentationDDD17mIoU71.88CMX
Semantic SegmentationEvent-based Segmentation DatasetmIoU85.81CMX
Semantic SegmentationSpectralWastemIoU58.2CMX (RGB-HYPER)
Semantic SegmentationSpectralWastemIoU56.6CMX ( RGB-HYPER3 )
Semantic Segmentation PotsdammIoU85.97CMX
Semantic SegmentationTLCGISIoU84.14CMX
Semantic SegmentationDeLiVER mIoU62.67CMX (RGB-Depth)
Semantic SegmentationDeLiVER mIoU56.52CMX (RGB-Event)
Semantic SegmentationDeLiVER mIoU56.37CMX (RGB-LiDAR)
Semantic SegmentationEventScapemIoU64.28CMX (B4)
Semantic SegmentationEventScapemIoU61.9CMX (B2)
Semantic SegmentationGAMUSmIoU75.23CMX
Semantic SegmentationVaihingenmIoU82.87CMX
Semantic SegmentationBJRoadIoU62.28CMX
Semantic SegmentationStanford2D3D - RGBDPixel Accuracy82.6CMX (SegFormer-B4)
Semantic SegmentationStanford2D3D - RGBDmIoU62.1CMX (SegFormer-B4)
Semantic SegmentationStanford2D3D - RGBDPixel Accuracy82.3CMX (SegFormer-B2)
Semantic SegmentationStanford2D3D - RGBDmIoU61.2CMX (SegFormer-B2)
Semantic SegmentationNoisy RS RGB-T DatasetmIoU56.1CMX (B4)
Semantic SegmentationKP day-nightmIoU46.2CMX
Semantic SegmentationRGB-T-Glass-SegmentationMAE0.029CMX
Semantic SegmentationMFN DatasetmIOU59.7CMX (B4)
Semantic SegmentationMFN DatasetmIOU58.2CMX (B2)
Object DetectionDSECmAP29.1CMX
Object DetectionInOutDoor AP62.3CMX
Object DetectionEventPedAP58CMX
Object DetectionPKU-DDD17-Car mAP5080.4CMX
Object DetectionSTCrowdAP61CMX
Object DetectionPCOD_1200S-Measure0.922CMX
3DDSECmAP29.1CMX
3DInOutDoor AP62.3CMX
3DEventPedAP58CMX
3DPKU-DDD17-Car mAP5080.4CMX
3DSTCrowdAP61CMX
3DPCOD_1200S-Measure0.922CMX
Camouflaged Object SegmentationPCOD_1200S-Measure0.922CMX
Object SegmentationPCOD_1200S-Measure0.922CMX
2D ClassificationDSECmAP29.1CMX
2D ClassificationInOutDoor AP62.3CMX
2D ClassificationEventPedAP58CMX
2D ClassificationPKU-DDD17-Car mAP5080.4CMX
2D ClassificationSTCrowdAP61CMX
2D ClassificationPCOD_1200S-Measure0.922CMX
Pedestrian DetectionDVTOD mAP81.6CMX
Pedestrian DetectionLLVIPAP0.596CMX
Pedestrian DetectionCVC14AP5068.9CMX
Scene SegmentationNoisy RS RGB-T DatasetmIoU56.1CMX (B4)
Scene SegmentationKP day-nightmIoU46.2CMX
Scene SegmentationRGB-T-Glass-SegmentationMAE0.029CMX
Scene SegmentationMFN DatasetmIOU59.7CMX (B4)
Scene SegmentationMFN DatasetmIOU58.2CMX (B2)
2D Object DetectionDSECmAP29.1CMX
2D Object DetectionInOutDoor AP62.3CMX
2D Object DetectionEventPedAP58CMX
2D Object DetectionPKU-DDD17-Car mAP5080.4CMX
2D Object DetectionSTCrowdAP61CMX
2D Object DetectionPCOD_1200S-Measure0.922CMX
2D Object DetectionNoisy RS RGB-T DatasetmIoU56.1CMX (B4)
2D Object DetectionKP day-nightmIoU46.2CMX
2D Object DetectionRGB-T-Glass-SegmentationMAE0.029CMX
2D Object DetectionMFN DatasetmIOU59.7CMX (B4)
2D Object DetectionMFN DatasetmIOU58.2CMX (B2)
Image Manipulation LocalizationColumbiaAverage Pixel F1(Fixed threshold)0.884CMX (RGB+NP++)
Image Manipulation LocalizationColumbiaAverage Pixel F1(Fixed threshold)0.872CMX (RGB+Bayar)
Image Manipulation LocalizationColumbiaAverage Pixel F1(Fixed threshold)0.834CMX (RGB+SRM)
Image Manipulation LocalizationCOVERAGEAverage Pixel F1(Fixed threshold)0.63CMX (RGB+SRM)
Image Manipulation LocalizationCOVERAGEAverage Pixel F1(Fixed threshold)0.592CMX (RGB+Bayar)
Image Manipulation LocalizationCOVERAGEAverage Pixel F1(Fixed threshold)0.577CMX (RGB+NP++)
Image Manipulation LocalizationCasia V1+Average Pixel F1(Fixed threshold)0.791CMX (RGB+SRM)
Image Manipulation LocalizationCasia V1+Average Pixel F1(Fixed threshold)0.774CMX (RGB+Bayar)
Image Manipulation LocalizationCasia V1+Average Pixel F1(Fixed threshold)0.761CMX (RGB+NP++)
Image Manipulation LocalizationCocoGlideAverage Pixel F1(Fixed threshold)0.585CMX (RGB+SRM)
Image Manipulation LocalizationCocoGlideAverage Pixel F1(Fixed threshold)0.566CMX (RGB+Bayar)
Image Manipulation LocalizationCocoGlideAverage Pixel F1(Fixed threshold)0.516CMX (RGB+NP++)
Image Manipulation LocalizationDSO-1Average Pixel F1(Fixed threshold)0.895CMX (RGB+NP++)
Image Manipulation LocalizationDSO-1Average Pixel F1(Fixed threshold)0.792CMX (RGB+SRM)
Image Manipulation LocalizationDSO-1Average Pixel F1(Fixed threshold)0.776CMX (RGB+Bayar)
10-shot image generation US3DmIoU84.63CMX
10-shot image generationUPLightmIoU92.13CMX (B2 RGB-AoLP)
10-shot image generationUPLightmIoU92.07CMX (B2 RGB-DoLP)
10-shot image generationKITTI-360mIoU64.43CMX (RGB-Depth)
10-shot image generationKITTI-360mIoU64.31CMX (RGB-LiDAR)
10-shot image generationPortoIoU72.85CMX
10-shot image generationReplicamIoU17CMX
10-shot image generationDSECmIoU72.42CMX
10-shot image generationSYN-UDTIRIIoU93.31CMX
10-shot image generation Synthetic Bathing Perception mIoU94.2CMX-SRA
10-shot image generation Synthetic Bathing Perception mIoU88.23CMX
10-shot image generationLLRGBD-syntheticmIoU66.52CMX (SegFormer-B2)
10-shot image generationCityscapes valmIoU82.6CMX (B4)
10-shot image generationCityscapes valmIoU81.6CMX (B2)
10-shot image generationSELMAmIoU91.7CMX
10-shot image generationZJU-RGB-PmIoU92.6CMX (B4 RGB-AoLP)
10-shot image generationZJU-RGB-PmIoU92.2CMX (B2 RGB-DoLP)
10-shot image generationDDD17mIoU71.88CMX
10-shot image generationEvent-based Segmentation DatasetmIoU85.81CMX
10-shot image generationSpectralWastemIoU58.2CMX (RGB-HYPER)
10-shot image generationSpectralWastemIoU56.6CMX ( RGB-HYPER3 )
10-shot image generation PotsdammIoU85.97CMX
10-shot image generationTLCGISIoU84.14CMX
10-shot image generationDeLiVER mIoU62.67CMX (RGB-Depth)
10-shot image generationDeLiVER mIoU56.52CMX (RGB-Event)
10-shot image generationDeLiVER mIoU56.37CMX (RGB-LiDAR)
10-shot image generationEventScapemIoU64.28CMX (B4)
10-shot image generationEventScapemIoU61.9CMX (B2)
10-shot image generationGAMUSmIoU75.23CMX
10-shot image generationVaihingenmIoU82.87CMX
10-shot image generationBJRoadIoU62.28CMX
10-shot image generationStanford2D3D - RGBDPixel Accuracy82.6CMX (SegFormer-B4)
10-shot image generationStanford2D3D - RGBDmIoU62.1CMX (SegFormer-B4)
10-shot image generationStanford2D3D - RGBDPixel Accuracy82.3CMX (SegFormer-B2)
10-shot image generationStanford2D3D - RGBDmIoU61.2CMX (SegFormer-B2)
10-shot image generationNoisy RS RGB-T DatasetmIoU56.1CMX (B4)
10-shot image generationKP day-nightmIoU46.2CMX
10-shot image generationRGB-T-Glass-SegmentationMAE0.029CMX
10-shot image generationMFN DatasetmIOU59.7CMX (B4)
10-shot image generationMFN DatasetmIOU58.2CMX (B2)
16kDSECmAP29.1CMX
16kInOutDoor AP62.3CMX
16kEventPedAP58CMX
16kPKU-DDD17-Car mAP5080.4CMX
16kSTCrowdAP61CMX
16kPCOD_1200S-Measure0.922CMX

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17