EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, Peter Wonka

2023-12-13Referring Expression Segmentation Depth Estimation Monocular Depth Estimation

Abstract

This work presents the network architecture EVP (Enhanced Visual Perception). EVP builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We propose two major enhancements. First, we develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities by aggregating spatial information from higher pyramid levels. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone. The resulting architecture is suitable for a wide variety of tasks and we demonstrate its performance in the context of single-image depth estimation with a specialized decoder using classification-based bins and referring segmentation with an off-the-shelf decoder. Comprehensive experiments conducted on established datasets show that EVP achieves state-of-the-art results in single-image depth estimation for indoor (NYU Depth v2, 11.8% RMSE improvement over VPD) and outdoor (KITTI) environments, as well as referring segmentation (RefCOCO, 2.53 IoU improvement over ReLA). The code and pre-trained models are publicly available at https://github.com/Lavreniuk/EVP.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	RMS	0.224	EVP
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.976	EVP
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.997	EVP
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.999	EVP
Depth Estimation	NYU-Depth V2	RMSE	0.224	EVP
Depth Estimation	NYU-Depth V2	absolute relative error	0.061	EVP
Depth Estimation	NYU-Depth V2	log 10	0.027	EVP
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.98	EVP
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.998	EVP
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	1	EVP
Depth Estimation	KITTI Eigen split	RMSE	2.015	EVP
Depth Estimation	KITTI Eigen split	RMSE log	0.073	EVP
Depth Estimation	KITTI Eigen split	Sq Rel	0.136	EVP
Depth Estimation	KITTI Eigen split	absolute relative error	0.048	EVP
3D	NYU-Depth V2	RMS	0.224	EVP
3D	NYU-Depth V2	Delta < 1.25	0.976	EVP
3D	NYU-Depth V2	Delta < 1.25^2	0.997	EVP
3D	NYU-Depth V2	Delta < 1.25^3	0.999	EVP
3D	NYU-Depth V2	RMSE	0.224	EVP
3D	NYU-Depth V2	absolute relative error	0.061	EVP
3D	NYU-Depth V2	log 10	0.027	EVP
3D	KITTI Eigen split	Delta < 1.25	0.98	EVP
3D	KITTI Eigen split	Delta < 1.25^2	0.998	EVP
3D	KITTI Eigen split	Delta < 1.25^3	1	EVP
3D	KITTI Eigen split	RMSE	2.015	EVP
3D	KITTI Eigen split	RMSE log	0.073	EVP
3D	KITTI Eigen split	Sq Rel	0.136	EVP
3D	KITTI Eigen split	absolute relative error	0.048	EVP
Instance Segmentation	RefCOCO	IoU	77.61	EVP
Instance Segmentation	RefCOCO	IoU (%)	77.61	EVP
Instance Segmentation	RefCOCO testA	Overall IoU	78.75	EVP
Instance Segmentation	RefCOCO testB	Overall IoU	72.94	EVP
Referring Expression Segmentation	RefCOCO	IoU	77.61	EVP
Referring Expression Segmentation	RefCOCO	IoU (%)	77.61	EVP
Referring Expression Segmentation	RefCOCO testA	Overall IoU	78.75	EVP
Referring Expression Segmentation	RefCOCO testB	Overall IoU	72.94	EVP

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

Abstract

Results

Related Papers

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

Abstract

Results

Related Papers