TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/EVP: Enhanced Visual Perception using Inverse Multi-Attent...

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, Peter Wonka

2023-12-13Referring Expression SegmentationDepth EstimationMonocular Depth Estimation
PaperPDFCode(official)

Abstract

This work presents the network architecture EVP (Enhanced Visual Perception). EVP builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We propose two major enhancements. First, we develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities by aggregating spatial information from higher pyramid levels. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone. The resulting architecture is suitable for a wide variety of tasks and we demonstrate its performance in the context of single-image depth estimation with a specialized decoder using classification-based bins and referring segmentation with an off-the-shelf decoder. Comprehensive experiments conducted on established datasets show that EVP achieves state-of-the-art results in single-image depth estimation for indoor (NYU Depth v2, 11.8% RMSE improvement over VPD) and outdoor (KITTI) environments, as well as referring segmentation (RefCOCO, 2.53 IoU improvement over ReLA). The code and pre-trained models are publicly available at https://github.com/Lavreniuk/EVP.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2RMS0.224EVP
Depth EstimationNYU-Depth V2Delta < 1.250.976EVP
Depth EstimationNYU-Depth V2Delta < 1.25^20.997EVP
Depth EstimationNYU-Depth V2Delta < 1.25^30.999EVP
Depth EstimationNYU-Depth V2RMSE0.224EVP
Depth EstimationNYU-Depth V2absolute relative error0.061EVP
Depth EstimationNYU-Depth V2log 100.027EVP
Depth EstimationKITTI Eigen splitDelta < 1.250.98EVP
Depth EstimationKITTI Eigen splitDelta < 1.25^20.998EVP
Depth EstimationKITTI Eigen splitDelta < 1.25^31EVP
Depth EstimationKITTI Eigen splitRMSE2.015EVP
Depth EstimationKITTI Eigen splitRMSE log0.073EVP
Depth EstimationKITTI Eigen splitSq Rel0.136EVP
Depth EstimationKITTI Eigen splitabsolute relative error0.048EVP
3DNYU-Depth V2RMS0.224EVP
3DNYU-Depth V2Delta < 1.250.976EVP
3DNYU-Depth V2Delta < 1.25^20.997EVP
3DNYU-Depth V2Delta < 1.25^30.999EVP
3DNYU-Depth V2RMSE0.224EVP
3DNYU-Depth V2absolute relative error0.061EVP
3DNYU-Depth V2log 100.027EVP
3DKITTI Eigen splitDelta < 1.250.98EVP
3DKITTI Eigen splitDelta < 1.25^20.998EVP
3DKITTI Eigen splitDelta < 1.25^31EVP
3DKITTI Eigen splitRMSE2.015EVP
3DKITTI Eigen splitRMSE log0.073EVP
3DKITTI Eigen splitSq Rel0.136EVP
3DKITTI Eigen splitabsolute relative error0.048EVP
Instance SegmentationRefCOCOIoU77.61EVP
Instance SegmentationRefCOCOIoU (%)77.61EVP
Instance SegmentationRefCOCO testAOverall IoU78.75EVP
Instance SegmentationRefCOCO testBOverall IoU72.94EVP
Referring Expression SegmentationRefCOCOIoU77.61EVP
Referring Expression SegmentationRefCOCOIoU (%)77.61EVP
Referring Expression SegmentationRefCOCO testAOverall IoU78.75EVP
Referring Expression SegmentationRefCOCO testBOverall IoU72.94EVP

Related Papers

$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network2025-07-15Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation2025-07-15Cameras as Relative Positional Encoding2025-07-14ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11