TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Bridging Video-text Retrieval with Multiple Choice Questions

Bridging Video-text Retrieval with Multiple Choice Questions

Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

2022-01-13CVPR 2022 1Video RetrievalVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalText-to-video searchText to Video RetrievalZero-Shot Action RecognitionAction RecognitionRetrievalMultiple-choiceLinear evaluationVideo to Text Retrieval
PaperPDFCode(official)Code

Abstract

Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zero-shot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank3BridgeFormer
VideoMSR-VTT-1kAtext-to-video R@137.6BridgeFormer
VideoMSR-VTT-1kAtext-to-video R@1075.1BridgeFormer
VideoMSR-VTT-1kAtext-to-video R@564.8BridgeFormer
VideoMSR-VTT-1kAtext-to-video Median Rank7BridgeFormer (Zero-shot)
VideoMSR-VTT-1kAtext-to-video R@126BridgeFormer (Zero-shot)
VideoMSR-VTT-1kAtext-to-video R@1056.4BridgeFormer (Zero-shot)
VideoMSR-VTT-1kAtext-to-video R@546.4BridgeFormer (Zero-shot)
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank3BridgeFormer
Video RetrievalMSR-VTT-1kAtext-to-video R@137.6BridgeFormer
Video RetrievalMSR-VTT-1kAtext-to-video R@1075.1BridgeFormer
Video RetrievalMSR-VTT-1kAtext-to-video R@564.8BridgeFormer
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank7BridgeFormer (Zero-shot)
Video RetrievalMSR-VTT-1kAtext-to-video R@126BridgeFormer (Zero-shot)
Video RetrievalMSR-VTT-1kAtext-to-video R@1056.4BridgeFormer (Zero-shot)
Video RetrievalMSR-VTT-1kAtext-to-video R@546.4BridgeFormer (Zero-shot)
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank7Y. Ge et. al.
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@126Y. Ge et. al.
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1056.4Y. Ge et. al.
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@546.4Y. Ge et. al.
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank2Y. Ge et. al.
Zero-Shot Video RetrievalMSVDtext-to-video R@143.6Y. Ge et. al.
Zero-Shot Video RetrievalMSVDtext-to-video R@1084.9Y. Ge et. al.
Zero-Shot Video RetrievalMSVDtext-to-video R@574.9Y. Ge et. al.
Zero-Shot Video RetrievalDiDeMotext-to-video Median Rank5Y. Ge et. al.
Zero-Shot Video RetrievalDiDeMotext-to-video R@125.6Y. Ge et. al.
Zero-Shot Video RetrievalDiDeMotext-to-video R@1061.1Y. Ge et. al.
Zero-Shot Video RetrievalDiDeMotext-to-video R@550.6Y. Ge et. al.
Zero-Shot Video RetrievalLSMDCtext-to-video Median Rank42Y. Ge et. al.
Zero-Shot Video RetrievalLSMDCtext-to-video R@112.2Y. Ge et. al.
Zero-Shot Video RetrievalLSMDCtext-to-video R@1032.2Y. Ge et. al.
Zero-Shot Video RetrievalLSMDCtext-to-video R@525.9Y. Ge et. al.

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16