TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Language-Visual Embedding for Movie Understanding...

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

Atousa Torabi, Niket Tandon, Leonid Sigal

2016-09-26Video RetrievalRetrievalMultiple-choice
PaperPDF

Abstract

Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases based on "Knowlywood", an activity knowledge mining model. Our best model archives Recall@10 of 19.2% on annotation and 18.9% on video retrieval tasks for subset of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11% over whole LSMDC16 public test-set.

Results

TaskDatasetMetricValueModel
VideoMSR-VTTtext-to-video Median Rank55C+LSTM+SA+FC7
VideoMSR-VTTtext-to-video R@14.2C+LSTM+SA+FC7
VideoMSR-VTTtext-to-video R@1019.9C+LSTM+SA+FC7
VideoMSR-VTTvideo-to-text R@512.9C+LSTM+SA+FC7
Video RetrievalMSR-VTTtext-to-video Median Rank55C+LSTM+SA+FC7
Video RetrievalMSR-VTTtext-to-video R@14.2C+LSTM+SA+FC7
Video RetrievalMSR-VTTtext-to-video R@1019.9C+LSTM+SA+FC7
Video RetrievalMSR-VTTvideo-to-text R@512.9C+LSTM+SA+FC7

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16