Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, Peter Wonka
This work presents the network architecture EVP (Enhanced Visual Perception). EVP builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We propose two major enhancements. First, we develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities by aggregating spatial information from higher pyramid levels. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone. The resulting architecture is suitable for a wide variety of tasks and we demonstrate its performance in the context of single-image depth estimation with a specialized decoder using classification-based bins and referring segmentation with an off-the-shelf decoder. Comprehensive experiments conducted on established datasets show that EVP achieves state-of-the-art results in single-image depth estimation for indoor (NYU Depth v2, 11.8% RMSE improvement over VPD) and outdoor (KITTI) environments, as well as referring segmentation (RefCOCO, 2.53 IoU improvement over ReLA). The code and pre-trained models are publicly available at https://github.com/Lavreniuk/EVP.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | NYU-Depth V2 | RMS | 0.224 | EVP |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25 | 0.976 | EVP |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^2 | 0.997 | EVP |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^3 | 0.999 | EVP |
| Depth Estimation | NYU-Depth V2 | RMSE | 0.224 | EVP |
| Depth Estimation | NYU-Depth V2 | absolute relative error | 0.061 | EVP |
| Depth Estimation | NYU-Depth V2 | log 10 | 0.027 | EVP |
| Depth Estimation | KITTI Eigen split | Delta < 1.25 | 0.98 | EVP |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^2 | 0.998 | EVP |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^3 | 1 | EVP |
| Depth Estimation | KITTI Eigen split | RMSE | 2.015 | EVP |
| Depth Estimation | KITTI Eigen split | RMSE log | 0.073 | EVP |
| Depth Estimation | KITTI Eigen split | Sq Rel | 0.136 | EVP |
| Depth Estimation | KITTI Eigen split | absolute relative error | 0.048 | EVP |
| 3D | NYU-Depth V2 | RMS | 0.224 | EVP |
| 3D | NYU-Depth V2 | Delta < 1.25 | 0.976 | EVP |
| 3D | NYU-Depth V2 | Delta < 1.25^2 | 0.997 | EVP |
| 3D | NYU-Depth V2 | Delta < 1.25^3 | 0.999 | EVP |
| 3D | NYU-Depth V2 | RMSE | 0.224 | EVP |
| 3D | NYU-Depth V2 | absolute relative error | 0.061 | EVP |
| 3D | NYU-Depth V2 | log 10 | 0.027 | EVP |
| 3D | KITTI Eigen split | Delta < 1.25 | 0.98 | EVP |
| 3D | KITTI Eigen split | Delta < 1.25^2 | 0.998 | EVP |
| 3D | KITTI Eigen split | Delta < 1.25^3 | 1 | EVP |
| 3D | KITTI Eigen split | RMSE | 2.015 | EVP |
| 3D | KITTI Eigen split | RMSE log | 0.073 | EVP |
| 3D | KITTI Eigen split | Sq Rel | 0.136 | EVP |
| 3D | KITTI Eigen split | absolute relative error | 0.048 | EVP |
| Instance Segmentation | RefCOCO | IoU | 77.61 | EVP |
| Instance Segmentation | RefCOCO | IoU (%) | 77.61 | EVP |
| Instance Segmentation | RefCOCO testA | Overall IoU | 78.75 | EVP |
| Instance Segmentation | RefCOCO testB | Overall IoU | 72.94 | EVP |
| Referring Expression Segmentation | RefCOCO | IoU | 77.61 | EVP |
| Referring Expression Segmentation | RefCOCO | IoU (%) | 77.61 | EVP |
| Referring Expression Segmentation | RefCOCO testA | Overall IoU | 78.75 | EVP |
| Referring Expression Segmentation | RefCOCO testB | Overall IoU | 72.94 | EVP |