AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

Siqi Du, Weixi Wang, Renzhong Guo, Ruisheng Wang, Yibin Tian, Shengjun Tang

2023-09-25Representation Learning feature selection Quantization Real-Time Semantic Segmentation Segmentation Semantic Segmentation

Paper PDF Code(official)

Abstract

Understanding indoor scenes is crucial for urban studies. Considering the dynamic nature of indoor environments, effective semantic segmentation requires both real-time operation and high accuracy.To address this, we propose AsymFormer, a novel network that improves real-time semantic segmentation accuracy using RGB-D multi-modal information without substantially increasing network complexity. AsymFormer uses an asymmetrical backbone for multimodal feature extraction, reducing redundant parameters by optimizing computational resource distribution. To fuse asymmetric multimodal features, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. The AsymFormer demonstrates competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS (79 FPS after implementing mixed precision quantization) on RTX3090, demonstrating that AsymFormer can strike a balance between high accuracy and efficiency.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	NYU Depth v2	Speed(ms/f)	15.3	AsymFormer
Semantic Segmentation	NYU Depth v2	mIoU	54.1	AsymFormer
10-shot image generation	NYU Depth v2	Speed(ms/f)	15.3	AsymFormer
10-shot image generation	NYU Depth v2	mIoU	54.1	AsymFormer

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04 SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21 Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC2025-07-18 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 mNARX+: A surrogate model for complex dynamical systems using manifold-NARX and automatic feature selection2025-07-17 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17