TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Align and Prompt: Video-and-Language Pre-training with Ent...

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C. H. Hoi

2021-12-17CVPR 2022 1Video RetrievalZero-Shot Video Retrievalcross-modal alignmentEntity AlignmentRetrievalVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a new visually-grounded pre-training task, prompting entity modeling (PEM), which aims to learn fine-grained region-entity alignment. To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names. The PEM task then asks the model to predict the entity pseudo-labels (i.e~normalized similarity scores) for randomly-selected video crops. The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Our code and pre-trained models are available at https://github.com/salesforce/ALPRO.

Results

TaskDatasetMetricValueModel
VideoDiDeMotext-to-video Median Rank3ALPRO
VideoDiDeMotext-to-video R@135.9ALPRO
VideoDiDeMotext-to-video R@1078.8ALPRO
VideoDiDeMotext-to-video R@567.5ALPRO
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.421ALPRO
Visual Question Answering (VQA)MSVD-QAAccuracy0.459ALPRO
Video RetrievalDiDeMotext-to-video Median Rank3ALPRO
Video RetrievalDiDeMotext-to-video R@135.9ALPRO
Video RetrievalDiDeMotext-to-video R@1078.8ALPRO
Video RetrievalDiDeMotext-to-video R@567.5ALPRO
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank8ALPRO
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@124.1ALPRO
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1055.4ALPRO
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@544.7ALPRO
Zero-Shot Video RetrievalDiDeMotext-to-video Median Rank6ALPRO
Zero-Shot Video RetrievalDiDeMotext-to-video R@123.8ALPRO
Zero-Shot Video RetrievalDiDeMotext-to-video R@1057.9ALPRO
Zero-Shot Video RetrievalDiDeMotext-to-video R@547.3ALPRO

Related Papers

Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16