TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Use What You Have: Video Retrieval Using Representations F...

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman

2019-07-31Video RetrievalSpecificityRetrievalNatural Language Queries
PaperPDFCodeCodeCode

Abstract

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human-generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing specific details such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include 'general' features such as motion, appearance, and scene features from visual content. We also explore the use of more 'specific' cues from ASR and OCR which are intermittently available for videos and find that these signals remain challenging to use effectively for retrieval. We propose a collaborative experts model to aggregate information from these different pre-trained experts and assess our approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and data can be found at www.robots.ox.ac.uk/~vgg/research/collaborative-experts/. This paper contains a correction to results reported in the previous version.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank28.2Collaborative Experts
VideoMSR-VTT-1kAtext-to-video Median Rank6Collaborative Experts
VideoMSR-VTT-1kAtext-to-video R@120.9Collaborative Experts
VideoMSR-VTT-1kAtext-to-video R@1062.4Collaborative Experts
VideoMSR-VTT-1kAtext-to-video R@548.8Collaborative Experts
VideoActivityNettext-to-video Mean Rank23.1Collaborative Experts
VideoActivityNettext-to-video Median Rank6Collaborative Experts
VideoActivityNettext-to-video R@120.5Collaborative Experts
VideoActivityNettext-to-video R@1063.9Collaborative Experts
VideoActivityNettext-to-video R@547.7Collaborative Experts
VideoActivityNettext-to-video R@5091.4Collaborative Experts
VideoDiDeMotext-to-video Mean Rank43.7Collaborative Experts
VideoDiDeMotext-to-video Median Rank8.3Collaborative Experts
VideoDiDeMotext-to-video R@116.1Collaborative Experts
VideoDiDeMotext-to-video R@1054.4Collaborative Experts
VideoDiDeMotext-to-video R@541.1Collaborative Experts
VideoDiDeMotext-to-video R@5082.7Collaborative Experts
VideoMSR-VTTtext-to-video Mean Rank86.8Collaborative Experts
VideoMSR-VTTtext-to-video Median Rank16Collaborative Experts
VideoMSR-VTTtext-to-video R@110Collaborative Experts
VideoMSR-VTTtext-to-video R@1041.2Collaborative Experts
VideoMSR-VTTtext-to-video R@529Collaborative Experts
VideoMSR-VTTvideo-to-text Mean Rank38.1Collaborative Experts
VideoMSR-VTTvideo-to-text Median Rank8.3Collaborative Experts
VideoMSR-VTTvideo-to-text R@115.6Collaborative Experts
VideoMSR-VTTvideo-to-text R@1055.2Collaborative Experts
VideoMSR-VTTvideo-to-text R@540.9Collaborative Experts
VideoLSMDCtext-to-video Median Rank25Collaborative Experts
VideoLSMDCtext-to-video R@111.2Collaborative Experts
VideoLSMDCtext-to-video R@1034.8Collaborative Experts
VideoLSMDCtext-to-video R@526.9Collaborative Experts
VideoMSVDtext-to-video Mean Rank23.1Collaborative Experts
VideoMSVDtext-to-video Median Rank6Collaborative Experts
VideoMSVDtext-to-video R@119.8Collaborative Experts
VideoMSVDtext-to-video R@1063.8Collaborative Experts
VideoMSVDtext-to-video R@549Collaborative Experts
VideoMSVDtext-to-video R@5089Collaborative Experts
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank28.2Collaborative Experts
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank6Collaborative Experts
Video RetrievalMSR-VTT-1kAtext-to-video R@120.9Collaborative Experts
Video RetrievalMSR-VTT-1kAtext-to-video R@1062.4Collaborative Experts
Video RetrievalMSR-VTT-1kAtext-to-video R@548.8Collaborative Experts
Video RetrievalActivityNettext-to-video Mean Rank23.1Collaborative Experts
Video RetrievalActivityNettext-to-video Median Rank6Collaborative Experts
Video RetrievalActivityNettext-to-video R@120.5Collaborative Experts
Video RetrievalActivityNettext-to-video R@1063.9Collaborative Experts
Video RetrievalActivityNettext-to-video R@547.7Collaborative Experts
Video RetrievalActivityNettext-to-video R@5091.4Collaborative Experts
Video RetrievalDiDeMotext-to-video Mean Rank43.7Collaborative Experts
Video RetrievalDiDeMotext-to-video Median Rank8.3Collaborative Experts
Video RetrievalDiDeMotext-to-video R@116.1Collaborative Experts
Video RetrievalDiDeMotext-to-video R@1054.4Collaborative Experts
Video RetrievalDiDeMotext-to-video R@541.1Collaborative Experts
Video RetrievalDiDeMotext-to-video R@5082.7Collaborative Experts
Video RetrievalMSR-VTTtext-to-video Mean Rank86.8Collaborative Experts
Video RetrievalMSR-VTTtext-to-video Median Rank16Collaborative Experts
Video RetrievalMSR-VTTtext-to-video R@110Collaborative Experts
Video RetrievalMSR-VTTtext-to-video R@1041.2Collaborative Experts
Video RetrievalMSR-VTTtext-to-video R@529Collaborative Experts
Video RetrievalMSR-VTTvideo-to-text Mean Rank38.1Collaborative Experts
Video RetrievalMSR-VTTvideo-to-text Median Rank8.3Collaborative Experts
Video RetrievalMSR-VTTvideo-to-text R@115.6Collaborative Experts
Video RetrievalMSR-VTTvideo-to-text R@1055.2Collaborative Experts
Video RetrievalMSR-VTTvideo-to-text R@540.9Collaborative Experts
Video RetrievalLSMDCtext-to-video Median Rank25Collaborative Experts
Video RetrievalLSMDCtext-to-video R@111.2Collaborative Experts
Video RetrievalLSMDCtext-to-video R@1034.8Collaborative Experts
Video RetrievalLSMDCtext-to-video R@526.9Collaborative Experts
Video RetrievalMSVDtext-to-video Mean Rank23.1Collaborative Experts
Video RetrievalMSVDtext-to-video Median Rank6Collaborative Experts
Video RetrievalMSVDtext-to-video R@119.8Collaborative Experts
Video RetrievalMSVDtext-to-video R@1063.8Collaborative Experts
Video RetrievalMSVDtext-to-video R@549Collaborative Experts
Video RetrievalMSVDtext-to-video R@5089Collaborative Experts

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15