Yingjie Zhai, Deng-Ping Fan, Jufeng Yang, Ali Borji, Ling Shao, Junwei Han, Liang Wang
Multi-level feature fusion is a fundamental topic in computer vision. It has been exploited to detect, segment and classify objects at various scales. When multi-level features meet multi-modal cues, the optimal feature aggregation and multi-modal learning strategy become a hot potato. In this paper, we leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel cascaded refinement network. In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS). Second, we introduce a depth-enhanced module (DEM) to excavate informative depth cues from the channel and spatial views. Then, RGB and depth modalities are fused in a complementary way. Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is simple, efficient, and backbone-independent. Extensive experiments show that BBS-Net significantly outperforms eighteen SOTA models on eight challenging datasets under five evaluation measures, demonstrating the superiority of our approach ($\sim 4 \%$ improvement in S-measure $vs.$ the top-ranked model: DMRA-iccv2019). In addition, we provide a comprehensive analysis on the generalization ability of different RGB-D datasets and provide a powerful training set for future research.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | STERE | Average MAE | 0.041 | BBS-Net |
| Object Detection | STERE | S-Measure | 90.8 | BBS-Net |
| Object Detection | STERE | max E-Measure | 94.2 | BBS-Net |
| Object Detection | STERE | max F-Measure | 90.3 | BBS-Net |
| Object Detection | LFSD | Average MAE | 0.072 | BBS-Net |
| Object Detection | LFSD | S-Measure | 86.4 | BBS-Net |
| Object Detection | LFSD | max E-Measure | 90.1 | BBS-Net |
| Object Detection | LFSD | max F-Measure | 85.8 | BBS-Net |
| Object Detection | SIP | Average MAE | 0.055 | BBS-Net |
| Object Detection | SIP | S-Measure | 87.9 | BBS-Net |
| Object Detection | SIP | max E-Measure | 92.2 | BBS-Net |
| Object Detection | SIP | max F-Measure | 88.3 | BBS-Net |
| Object Detection | RGBD135 | Average MAE | 0.044 | BBS-Net |
| Object Detection | RGBD135 | S-Measure | 88.2 | BBS-Net |
| Object Detection | RGBD135 | max E-Measure | 91.9 | BBS-Net |
| Object Detection | RGBD135 | max F-Measure | 85.9 | BBS-Net |
| Object Detection | NLPR | Average MAE | 0.023 | BBS-Net |
| Object Detection | NLPR | S-Measure | 93 | BBS-Net |
| Object Detection | NLPR | max E-Measure | 96.1 | BBS-Net |
| Object Detection | NLPR | max F-Measure | 91.8 | BBS-Net |
| Object Detection | DES | Average MAE | 0.021 | BBS-Net |
| Object Detection | DES | S-Measure | 93.3 | BBS-Net |
| Object Detection | DES | max E-Measure | 96.6 | BBS-Net |
| Object Detection | DES | max F-Measure | 92.7 | BBS-Net |
| 3D | STERE | Average MAE | 0.041 | BBS-Net |
| 3D | STERE | S-Measure | 90.8 | BBS-Net |
| 3D | STERE | max E-Measure | 94.2 | BBS-Net |
| 3D | STERE | max F-Measure | 90.3 | BBS-Net |
| 3D | LFSD | Average MAE | 0.072 | BBS-Net |
| 3D | LFSD | S-Measure | 86.4 | BBS-Net |
| 3D | LFSD | max E-Measure | 90.1 | BBS-Net |
| 3D | LFSD | max F-Measure | 85.8 | BBS-Net |
| 3D | SIP | Average MAE | 0.055 | BBS-Net |
| 3D | SIP | S-Measure | 87.9 | BBS-Net |
| 3D | SIP | max E-Measure | 92.2 | BBS-Net |
| 3D | SIP | max F-Measure | 88.3 | BBS-Net |
| 3D | RGBD135 | Average MAE | 0.044 | BBS-Net |
| 3D | RGBD135 | S-Measure | 88.2 | BBS-Net |
| 3D | RGBD135 | max E-Measure | 91.9 | BBS-Net |
| 3D | RGBD135 | max F-Measure | 85.9 | BBS-Net |
| 3D | NLPR | Average MAE | 0.023 | BBS-Net |
| 3D | NLPR | S-Measure | 93 | BBS-Net |
| 3D | NLPR | max E-Measure | 96.1 | BBS-Net |
| 3D | NLPR | max F-Measure | 91.8 | BBS-Net |
| 3D | DES | Average MAE | 0.021 | BBS-Net |
| 3D | DES | S-Measure | 93.3 | BBS-Net |
| 3D | DES | max E-Measure | 96.6 | BBS-Net |
| 3D | DES | max F-Measure | 92.7 | BBS-Net |
| 2D Classification | STERE | Average MAE | 0.041 | BBS-Net |
| 2D Classification | STERE | S-Measure | 90.8 | BBS-Net |
| 2D Classification | STERE | max E-Measure | 94.2 | BBS-Net |
| 2D Classification | STERE | max F-Measure | 90.3 | BBS-Net |
| 2D Classification | LFSD | Average MAE | 0.072 | BBS-Net |
| 2D Classification | LFSD | S-Measure | 86.4 | BBS-Net |
| 2D Classification | LFSD | max E-Measure | 90.1 | BBS-Net |
| 2D Classification | LFSD | max F-Measure | 85.8 | BBS-Net |
| 2D Classification | SIP | Average MAE | 0.055 | BBS-Net |
| 2D Classification | SIP | S-Measure | 87.9 | BBS-Net |
| 2D Classification | SIP | max E-Measure | 92.2 | BBS-Net |
| 2D Classification | SIP | max F-Measure | 88.3 | BBS-Net |
| 2D Classification | RGBD135 | Average MAE | 0.044 | BBS-Net |
| 2D Classification | RGBD135 | S-Measure | 88.2 | BBS-Net |
| 2D Classification | RGBD135 | max E-Measure | 91.9 | BBS-Net |
| 2D Classification | RGBD135 | max F-Measure | 85.9 | BBS-Net |
| 2D Classification | NLPR | Average MAE | 0.023 | BBS-Net |
| 2D Classification | NLPR | S-Measure | 93 | BBS-Net |
| 2D Classification | NLPR | max E-Measure | 96.1 | BBS-Net |
| 2D Classification | NLPR | max F-Measure | 91.8 | BBS-Net |
| 2D Classification | DES | Average MAE | 0.021 | BBS-Net |
| 2D Classification | DES | S-Measure | 93.3 | BBS-Net |
| 2D Classification | DES | max E-Measure | 96.6 | BBS-Net |
| 2D Classification | DES | max F-Measure | 92.7 | BBS-Net |
| 2D Object Detection | STERE | Average MAE | 0.041 | BBS-Net |
| 2D Object Detection | STERE | S-Measure | 90.8 | BBS-Net |
| 2D Object Detection | STERE | max E-Measure | 94.2 | BBS-Net |
| 2D Object Detection | STERE | max F-Measure | 90.3 | BBS-Net |
| 2D Object Detection | LFSD | Average MAE | 0.072 | BBS-Net |
| 2D Object Detection | LFSD | S-Measure | 86.4 | BBS-Net |
| 2D Object Detection | LFSD | max E-Measure | 90.1 | BBS-Net |
| 2D Object Detection | LFSD | max F-Measure | 85.8 | BBS-Net |
| 2D Object Detection | SIP | Average MAE | 0.055 | BBS-Net |
| 2D Object Detection | SIP | S-Measure | 87.9 | BBS-Net |
| 2D Object Detection | SIP | max E-Measure | 92.2 | BBS-Net |
| 2D Object Detection | SIP | max F-Measure | 88.3 | BBS-Net |
| 2D Object Detection | RGBD135 | Average MAE | 0.044 | BBS-Net |
| 2D Object Detection | RGBD135 | S-Measure | 88.2 | BBS-Net |
| 2D Object Detection | RGBD135 | max E-Measure | 91.9 | BBS-Net |
| 2D Object Detection | RGBD135 | max F-Measure | 85.9 | BBS-Net |
| 2D Object Detection | NLPR | Average MAE | 0.023 | BBS-Net |
| 2D Object Detection | NLPR | S-Measure | 93 | BBS-Net |
| 2D Object Detection | NLPR | max E-Measure | 96.1 | BBS-Net |
| 2D Object Detection | NLPR | max F-Measure | 91.8 | BBS-Net |
| 2D Object Detection | DES | Average MAE | 0.021 | BBS-Net |
| 2D Object Detection | DES | S-Measure | 93.3 | BBS-Net |
| 2D Object Detection | DES | max E-Measure | 96.6 | BBS-Net |
| 2D Object Detection | DES | max F-Measure | 92.7 | BBS-Net |
| 16k | STERE | Average MAE | 0.041 | BBS-Net |
| 16k | STERE | S-Measure | 90.8 | BBS-Net |
| 16k | STERE | max E-Measure | 94.2 | BBS-Net |
| 16k | STERE | max F-Measure | 90.3 | BBS-Net |
| 16k | LFSD | Average MAE | 0.072 | BBS-Net |
| 16k | LFSD | S-Measure | 86.4 | BBS-Net |
| 16k | LFSD | max E-Measure | 90.1 | BBS-Net |
| 16k | LFSD | max F-Measure | 85.8 | BBS-Net |
| 16k | SIP | Average MAE | 0.055 | BBS-Net |
| 16k | SIP | S-Measure | 87.9 | BBS-Net |
| 16k | SIP | max E-Measure | 92.2 | BBS-Net |
| 16k | SIP | max F-Measure | 88.3 | BBS-Net |
| 16k | RGBD135 | Average MAE | 0.044 | BBS-Net |
| 16k | RGBD135 | S-Measure | 88.2 | BBS-Net |
| 16k | RGBD135 | max E-Measure | 91.9 | BBS-Net |
| 16k | RGBD135 | max F-Measure | 85.9 | BBS-Net |
| 16k | NLPR | Average MAE | 0.023 | BBS-Net |
| 16k | NLPR | S-Measure | 93 | BBS-Net |
| 16k | NLPR | max E-Measure | 96.1 | BBS-Net |
| 16k | NLPR | max F-Measure | 91.8 | BBS-Net |
| 16k | DES | Average MAE | 0.021 | BBS-Net |
| 16k | DES | S-Measure | 93.3 | BBS-Net |
| 16k | DES | max E-Measure | 96.6 | BBS-Net |
| 16k | DES | max F-Measure | 92.7 | BBS-Net |