Shuang Hao, Chunlin Zhong, He Tang
The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core components. 1) Language-driven Quality Assessment (LQA): Leveraging a pretrained vision-language model with a prompt learner, the LQA recalibrates image contributions without requiring additional quality annotations. This approach effectively mitigates the impact of noisy inputs. 2) Conditional Dropout (CD): A learning method to strengthen the model's adaptability in scenarios with missing modalities, while preserving its performance under complete modalities. The CD serves as a plug-in training scheme that treats modality-missing as conditions, strengthening the overall robustness of various dual-modal SOD models. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions. We will release source code upon acceptance.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | NJU2K | Average MAE | 0.029 | CoLANet |
| Object Detection | NJU2K | S-Measure | 93.4 | CoLANet |
| Object Detection | NJU2K | max E-Measure | 94.7 | CoLANet |
| Object Detection | NJU2K | max F-Measure | 91.3 | CoLANet |
| Object Detection | STERE | Average MAE | 0.039 | CoLANet |
| Object Detection | STERE | S-Measure | 90.8 | CoLANet |
| Object Detection | STERE | max E-Measure | 94.1 | CoLANet |
| Object Detection | STERE | max F-Measure | 88.9 | CoLANet |
| Object Detection | SIP | Average MAE | 0.042 | CoLANet |
| Object Detection | SIP | S-Measure | 89.5 | CoLANet |
| Object Detection | SIP | max E-Measure | 93.5 | CoLANet |
| Object Detection | SIP | max F-Measure | 89.4 | CoLANet |
| Object Detection | NLPR | Average MAE | 0.021 | CoLANet |
| Object Detection | NLPR | S-Measure | 93.5 | CoLANet |
| Object Detection | NLPR | max E-Measure | 95.7 | CoLANet |
| Object Detection | NLPR | max F-Measure | 90.9 | CoLANet |
| Object Detection | DES | Average MAE | 0.018 | CoLANet |
| Object Detection | DES | S-Measure | 93.5 | CoLANet |
| Object Detection | DES | max E-Measure | 96.3 | CoLANet |
| Object Detection | DES | max F-Measure | 92.5 | CoLANet |
| 3D | NJU2K | Average MAE | 0.029 | CoLANet |
| 3D | NJU2K | S-Measure | 93.4 | CoLANet |
| 3D | NJU2K | max E-Measure | 94.7 | CoLANet |
| 3D | NJU2K | max F-Measure | 91.3 | CoLANet |
| 3D | STERE | Average MAE | 0.039 | CoLANet |
| 3D | STERE | S-Measure | 90.8 | CoLANet |
| 3D | STERE | max E-Measure | 94.1 | CoLANet |
| 3D | STERE | max F-Measure | 88.9 | CoLANet |
| 3D | SIP | Average MAE | 0.042 | CoLANet |
| 3D | SIP | S-Measure | 89.5 | CoLANet |
| 3D | SIP | max E-Measure | 93.5 | CoLANet |
| 3D | SIP | max F-Measure | 89.4 | CoLANet |
| 3D | NLPR | Average MAE | 0.021 | CoLANet |
| 3D | NLPR | S-Measure | 93.5 | CoLANet |
| 3D | NLPR | max E-Measure | 95.7 | CoLANet |
| 3D | NLPR | max F-Measure | 90.9 | CoLANet |
| 3D | DES | Average MAE | 0.018 | CoLANet |
| 3D | DES | S-Measure | 93.5 | CoLANet |
| 3D | DES | max E-Measure | 96.3 | CoLANet |
| 3D | DES | max F-Measure | 92.5 | CoLANet |
| 2D Classification | NJU2K | Average MAE | 0.029 | CoLANet |
| 2D Classification | NJU2K | S-Measure | 93.4 | CoLANet |
| 2D Classification | NJU2K | max E-Measure | 94.7 | CoLANet |
| 2D Classification | NJU2K | max F-Measure | 91.3 | CoLANet |
| 2D Classification | STERE | Average MAE | 0.039 | CoLANet |
| 2D Classification | STERE | S-Measure | 90.8 | CoLANet |
| 2D Classification | STERE | max E-Measure | 94.1 | CoLANet |
| 2D Classification | STERE | max F-Measure | 88.9 | CoLANet |
| 2D Classification | SIP | Average MAE | 0.042 | CoLANet |
| 2D Classification | SIP | S-Measure | 89.5 | CoLANet |
| 2D Classification | SIP | max E-Measure | 93.5 | CoLANet |
| 2D Classification | SIP | max F-Measure | 89.4 | CoLANet |
| 2D Classification | NLPR | Average MAE | 0.021 | CoLANet |
| 2D Classification | NLPR | S-Measure | 93.5 | CoLANet |
| 2D Classification | NLPR | max E-Measure | 95.7 | CoLANet |
| 2D Classification | NLPR | max F-Measure | 90.9 | CoLANet |
| 2D Classification | DES | Average MAE | 0.018 | CoLANet |
| 2D Classification | DES | S-Measure | 93.5 | CoLANet |
| 2D Classification | DES | max E-Measure | 96.3 | CoLANet |
| 2D Classification | DES | max F-Measure | 92.5 | CoLANet |
| 2D Object Detection | NJU2K | Average MAE | 0.029 | CoLANet |
| 2D Object Detection | NJU2K | S-Measure | 93.4 | CoLANet |
| 2D Object Detection | NJU2K | max E-Measure | 94.7 | CoLANet |
| 2D Object Detection | NJU2K | max F-Measure | 91.3 | CoLANet |
| 2D Object Detection | STERE | Average MAE | 0.039 | CoLANet |
| 2D Object Detection | STERE | S-Measure | 90.8 | CoLANet |
| 2D Object Detection | STERE | max E-Measure | 94.1 | CoLANet |
| 2D Object Detection | STERE | max F-Measure | 88.9 | CoLANet |
| 2D Object Detection | SIP | Average MAE | 0.042 | CoLANet |
| 2D Object Detection | SIP | S-Measure | 89.5 | CoLANet |
| 2D Object Detection | SIP | max E-Measure | 93.5 | CoLANet |
| 2D Object Detection | SIP | max F-Measure | 89.4 | CoLANet |
| 2D Object Detection | NLPR | Average MAE | 0.021 | CoLANet |
| 2D Object Detection | NLPR | S-Measure | 93.5 | CoLANet |
| 2D Object Detection | NLPR | max E-Measure | 95.7 | CoLANet |
| 2D Object Detection | NLPR | max F-Measure | 90.9 | CoLANet |
| 2D Object Detection | DES | Average MAE | 0.018 | CoLANet |
| 2D Object Detection | DES | S-Measure | 93.5 | CoLANet |
| 2D Object Detection | DES | max E-Measure | 96.3 | CoLANet |
| 2D Object Detection | DES | max F-Measure | 92.5 | CoLANet |
| 16k | NJU2K | Average MAE | 0.029 | CoLANet |
| 16k | NJU2K | S-Measure | 93.4 | CoLANet |
| 16k | NJU2K | max E-Measure | 94.7 | CoLANet |
| 16k | NJU2K | max F-Measure | 91.3 | CoLANet |
| 16k | STERE | Average MAE | 0.039 | CoLANet |
| 16k | STERE | S-Measure | 90.8 | CoLANet |
| 16k | STERE | max E-Measure | 94.1 | CoLANet |
| 16k | STERE | max F-Measure | 88.9 | CoLANet |
| 16k | SIP | Average MAE | 0.042 | CoLANet |
| 16k | SIP | S-Measure | 89.5 | CoLANet |
| 16k | SIP | max E-Measure | 93.5 | CoLANet |
| 16k | SIP | max F-Measure | 89.4 | CoLANet |
| 16k | NLPR | Average MAE | 0.021 | CoLANet |
| 16k | NLPR | S-Measure | 93.5 | CoLANet |
| 16k | NLPR | max E-Measure | 95.7 | CoLANet |
| 16k | NLPR | max F-Measure | 90.9 | CoLANet |
| 16k | DES | Average MAE | 0.018 | CoLANet |
| 16k | DES | S-Measure | 93.5 | CoLANet |
| 16k | DES | max E-Measure | 96.3 | CoLANet |
| 16k | DES | max F-Measure | 92.5 | CoLANet |