TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MILES: Visual BERT Pre-training with Injected Language Sem...

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

2022-04-26Video RetrievalVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalText to Video RetrievalZero-Shot Action RecognitionAction RecognitionRetrievalVideo to Text Retrieval
PaperPDFCode(official)

Abstract

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.

Results

TaskDatasetMetricValueModel
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank7MILES
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@126.1MILES
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1056.9MILES
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@547.2MILES
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank2MILES
Zero-Shot Video RetrievalMSVDtext-to-video R@144.4MILES
Zero-Shot Video RetrievalMSVDtext-to-video R@1087MILES
Zero-Shot Video RetrievalMSVDtext-to-video R@576.2MILES
Zero-Shot Video RetrievalDiDeMotext-to-video Median Rank5MILES
Zero-Shot Video RetrievalDiDeMotext-to-video R@127.2MILES
Zero-Shot Video RetrievalDiDeMotext-to-video R@1063.6MILES
Zero-Shot Video RetrievalDiDeMotext-to-video R@550.3MILES
Zero-Shot Video RetrievalLSMDCtext-to-video Median Rank50.7MILES
Zero-Shot Video RetrievalLSMDCtext-to-video R@111.1MILES
Zero-Shot Video RetrievalLSMDCtext-to-video R@1030.6MILES
Zero-Shot Video RetrievalLSMDCtext-to-video R@524.7MILES

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16