TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PEVL: Position-enhanced Pre-training and Prompt Tuning for...

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Yuan YAO, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

2022-05-23Referring ExpressionVisual Relationship DetectionReferring Expression ComprehensionPhrase GroundingVisual Question Answering (VQA)Visual Commonsense ReasoningLanguage Modelling
PaperPDFCode(official)

Abstract

Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and competitive performance. However, the removal of object detectors also deprives the capability of VLP models in explicit object modeling, which is essential to various position-sensitive vision-language (VL) tasks, such as referring expression comprehension and visual commonsense reasoning. To address the challenge, we introduce PEVL that enhances the pre-training and prompt tuning of VLP models with explicit object position modeling. Specifically, PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks. We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs. We make the data and code for this paper publicly available at https://github.com/thunlp/PEVL.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)GQAAccuracy77PEVL+
Scene ParsingVisual GenomeR@10066.3PEVL
Scene ParsingVisual GenomeR@5064.4PEVL
Scene ParsingVisual GenomemR@10023.5PEVL
Scene ParsingVisual GenomemR@5021.7PEVL
Visual ReasoningVCR (Q-AR) devAccuracy57.8PEVL
Visual ReasoningVCR (Q-A) testAccuracy76PEVL
Visual ReasoningVCR (Q-AR) testAccuracy58.6PEVL
Visual ReasoningVCR (QA-R) devAccuracy76.4PEVL
Visual ReasoningVCR (Q-A) devAccuracy75.1PEVL
Visual ReasoningVCR (QA-R) testAccuracy76.7PEVL
Visual Relationship DetectionVisual GenomeR@10066.3PEVL
Visual Relationship DetectionVisual GenomeR@5064.4PEVL
Visual Relationship DetectionVisual GenomemR@10023.5PEVL
Visual Relationship DetectionVisual GenomemR@5021.7PEVL
Phrase GroundingFlickr30k Entities DevR@184.1PEVL
Phrase GroundingFlickr30k Entities TestR@184.4PEVL
Scene UnderstandingVisual GenomeR@10066.3PEVL
Scene UnderstandingVisual GenomeR@5064.4PEVL
Scene UnderstandingVisual GenomemR@10023.5PEVL
Scene UnderstandingVisual GenomemR@5021.7PEVL
2D Semantic SegmentationVisual GenomeR@10066.3PEVL
2D Semantic SegmentationVisual GenomeR@5064.4PEVL
2D Semantic SegmentationVisual GenomemR@10023.5PEVL
2D Semantic SegmentationVisual GenomemR@5021.7PEVL

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Assay2Mol: large language model-based drug design using BioAssay context2025-07-16