TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Seeing What You Miss: Vision-Language Pre-training with Se...

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

Yatai Ji, RongCheng Tu, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu

2022-11-24CVPR 2023 1Question AnsweringImage-text RetrievalVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalMasked Language Modelingcross-modal alignmentRetrievalVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCode(official)

Abstract

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Results

TaskDatasetMetricValueModel
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@130.9Yatai Ji et. al.
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1065Yatai Ji et. al.
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@554.4Yatai Ji et. al.
Zero-Shot Video RetrievalLSMDCtext-to-video R@117.2Yatai Ji et. al.
Zero-Shot Video RetrievalLSMDCtext-to-video R@1039.1Yatai Ji et. al.
Zero-Shot Video RetrievalLSMDCtext-to-video R@532.4Yatai Ji et. al.

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17