TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HowToCaption: Prompting LLMs to Transform Video Annotation...

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne

2023-10-07Automatic Speech RecognitionVideo RetrievalZero-Shot Video RetrievalVideo CaptioningZero-Shot Video-Audio Retrieval
PaperPDFCode(official)

Abstract

Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.

Results

TaskDatasetMetricValueModel
Video CaptioningMSR-VTTBLEU-449.8HowToCaption
Video CaptioningMSR-VTTCIDEr65.3HowToCaption
Video CaptioningMSR-VTTMETEOR32.2HowToCaption
Video CaptioningMSR-VTTROUGE-L66.3HowToCaption
Video CaptioningYouCook2BLEU-48.8HowToCaption
Video CaptioningYouCook2CIDEr116.4HowToCaption
Video CaptioningYouCook2METEOR15.9HowToCaption
Video CaptioningYouCook2ROUGE-L37.3HowToCaption
Video CaptioningMSVDBLEU-470.4HowToCaption
Video CaptioningMSVDCIDEr154.2HowToCaption
Video CaptioningMSVDMETEOR46.4HowToCaption
Video CaptioningMSVDROUGE-L83.2HowToCaption
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank1VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@150VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1081.4VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@573.2VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank3HowToCaption
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@137.6HowToCaption
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1073.3HowToCaption
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@562HowToCaption
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank1VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSVDtext-to-video R@154.8VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSVDtext-to-video R@1087.2VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSVDtext-to-video R@580.9VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank2HowToCaption
Zero-Shot Video RetrievalMSVDtext-to-video R@144.5HowToCaption
Zero-Shot Video RetrievalMSVDtext-to-video R@1082.1HowToCaption
Zero-Shot Video RetrievalMSVDtext-to-video R@573.3HowToCaption
Zero-Shot Video RetrievalLSMDCtext-to-video Median Rank7VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalLSMDCtext-to-video R@127.7VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalLSMDCtext-to-video R@1054.6VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalLSMDCtext-to-video R@546.5VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalLSMDCtext-to-video Median Rank29HowToCaption
Zero-Shot Video RetrievalLSMDCtext-to-video R@117.3HowToCaption
Zero-Shot Video RetrievalLSMDCtext-to-video R@1038.6HowToCaption
Zero-Shot Video RetrievalLSMDCtext-to-video R@531.7HowToCaption
Zero-Shot Video RetrievalYouCook2text-to-video Median Rank8VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalYouCook2text-to-video R@119.7VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalYouCook2text-to-video R@1053.9VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalYouCook2text-to-video R@543.6VAST, HowToCaption-finetuned
Zero-Shot Video RetrievalYouCook2text-to-video Median Rank15HowToCaption
Zero-Shot Video RetrievalYouCook2text-to-video R@113.4HowToCaption
Zero-Shot Video RetrievalYouCook2text-to-video R@1044.1HowToCaption
Zero-Shot Video RetrievalYouCook2text-to-video R@533.1HowToCaption

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR2025-06-25Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25