TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/When CNNs Meet Random RNNs: Towards Multi-Level Analysis f...

When CNNs Meet Random RNNs: Towards Multi-Level Analysis for RGB-D Object and Scene Recognition

Ali Caglayan, Nevrez Imamoglu, Ahmet Burak Can, Ryosuke Nakamura

2020-04-26Scene RecognitionObject Recognition
PaperPDFCode(official)

Abstract

Recognizing objects and scenes are two challenging but essential tasks in image understanding. In particular, the use of RGB-D sensors in handling these tasks has emerged as an important area of focus for better visual understanding. Meanwhile, deep neural networks, specifically convolutional neural networks (CNNs), have become widespread and have been applied to many visual tasks by replacing hand-crafted features with effective deep features. However, it is an open problem how to exploit deep features from a multi-layer CNN model effectively. In this paper, we propose a novel two-stage framework that extracts discriminative feature representations from multi-modal RGB-D images for object and scene recognition tasks. In the first stage, a pretrained CNN model has been employed as a backbone to extract visual features at multiple levels. The second stage maps these features into high level representations with a fully randomized structure of recursive neural networks (RNNs) efficiently. To cope with the high dimensionality of CNN activations, a random weighted pooling scheme has been proposed by extending the idea of randomness in RNNs. Multi-modal fusion has been performed through a soft voting approach by computing weights based on individual recognition confidences (i.e. SVM scores) of RGB and depth streams separately. This produces consistent class label estimation in final RGB-D classification performance. Extensive experiments verify that fully randomized structure in RNN stage encodes CNN activations to discriminative solid features successfully. Comparative experimental results on the popular Washington RGB-D Object and SUN RGB-D Scene datasets show that the proposed approach achieves superior or on-par performance compared to state-of-the-art methods both in object and scene recognition tasks. Code is available at https://github.com/acaglayan/CNN_randRNN.

Results

TaskDatasetMetricValueModel
Scene ParsingSUN-RGBDAccuracy (%)60.7ResNet101-RNN
AnimationSUN-RGBDAccuracy (%)60.7ResNet101-RNN
3D Character Animation From A Single PhotoSUN-RGBDAccuracy (%)60.7ResNet101-RNN
2D Semantic SegmentationSUN-RGBDAccuracy (%)60.7ResNet101-RNN

Related Papers

GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing2025-07-08Out-of-distribution detection in 3D applications: a review2025-07-01SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds2025-06-16Continual Hyperbolic Learning of Instances and Classes2025-06-12DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects2025-06-11Aligning Text, Images, and 3D Structure Token-by-Token2025-06-09STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving2025-06-06Feature-Based Lie Group Transformer for Real-World Applications2025-06-05