TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Position-guided Text Prompt for Vision-Language Pre-training

Position-guided Text Prompt for Vision-Language Pre-training

Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan

2022-12-19CVPR 2023 1Cross-Modal RetrievalZero-Shot Cross-Modal RetrievalVisual GroundingImage CaptioningVisual ReasoningRetrieval
PaperPDFCode(official)

Abstract

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.

Results

TaskDatasetMetricValueModel
Image CaptioningCOCO CaptionsBLEU-440.1PTP-BLIP (14M)
Image CaptioningCOCO CaptionsCIDER135PTP-BLIP (14M)
Image CaptioningCOCO CaptionsMETEOR30.4PTP-BLIP (14M)
Image CaptioningCOCO CaptionsSPICE23.7PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@181.5PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1097.9PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@595.9PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@164.9PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1092.2PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@587.4PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@187.1PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.3PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@598.4PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@173.1PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1094.8PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@591PTP-BLIP (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@169.7PTP-BLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1094.7PTP-BLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@590PTP-BLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@149.5PTP-BLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1084.2PTP-BLIP
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@575.9PTP-BLIP
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@181.5PTP-BLIP (14M)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1097.9PTP-BLIP (14M)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@595.9PTP-BLIP (14M)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@164.9PTP-BLIP (14M)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1092.2PTP-BLIP (14M)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@587.4PTP-BLIP (14M)
Cross-Modal RetrievalCOCO 2014Image-to-text R@181.5PTP-BLIP (14M)
Cross-Modal RetrievalCOCO 2014Image-to-text R@1097.9PTP-BLIP (14M)
Cross-Modal RetrievalCOCO 2014Image-to-text R@595.9PTP-BLIP (14M)
Cross-Modal RetrievalCOCO 2014Text-to-image R@164.9PTP-BLIP (14M)
Cross-Modal RetrievalCOCO 2014Text-to-image R@1092.2PTP-BLIP (14M)
Cross-Modal RetrievalCOCO 2014Text-to-image R@587.4PTP-BLIP (14M)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16