HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne

2023-10-07Automatic Speech Recognition Video Retrieval Zero-Shot Video Retrieval Video Captioning Zero-Shot Video-Audio Retrieval

Paper PDF Code(official)

Abstract

Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.

Results

Task	Dataset	Metric	Value	Model
Video Captioning	MSR-VTT	BLEU-4	49.8	HowToCaption
Video Captioning	MSR-VTT	CIDEr	65.3	HowToCaption
Video Captioning	MSR-VTT	METEOR	32.2	HowToCaption
Video Captioning	MSR-VTT	ROUGE-L	66.3	HowToCaption
Video Captioning	YouCook2	BLEU-4	8.8	HowToCaption
Video Captioning	YouCook2	CIDEr	116.4	HowToCaption
Video Captioning	YouCook2	METEOR	15.9	HowToCaption
Video Captioning	YouCook2	ROUGE-L	37.3	HowToCaption
Video Captioning	MSVD	BLEU-4	70.4	HowToCaption
Video Captioning	MSVD	CIDEr	154.2	HowToCaption
Video Captioning	MSVD	METEOR	46.4	HowToCaption
Video Captioning	MSVD	ROUGE-L	83.2	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	1	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	50	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	81.4	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	73.2	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	3	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	37.6	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	73.3	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	62	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	1	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	54.8	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	87.2	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	80.9	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	2	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	44.5	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	82.1	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	73.3	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video Median Rank	7	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	27.7	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	54.6	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	46.5	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video Median Rank	29	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	17.3	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	38.6	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	31.7	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video Median Rank	8	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	19.7	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	53.9	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	43.6	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video Median Rank	15	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	13.4	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	44.1	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	33.1	HowToCaption

Abstract

Results

Task	Dataset	Metric	Value	Model
Video Captioning	MSR-VTT	BLEU-4	49.8	HowToCaption
Video Captioning	MSR-VTT	CIDEr	65.3	HowToCaption
Video Captioning	MSR-VTT	METEOR	32.2	HowToCaption
Video Captioning	MSR-VTT	ROUGE-L	66.3	HowToCaption
Video Captioning	YouCook2	BLEU-4	8.8	HowToCaption
Video Captioning	YouCook2	CIDEr	116.4	HowToCaption
Video Captioning	YouCook2	METEOR	15.9	HowToCaption
Video Captioning	YouCook2	ROUGE-L	37.3	HowToCaption
Video Captioning	MSVD	BLEU-4	70.4	HowToCaption
Video Captioning	MSVD	CIDEr	154.2	HowToCaption
Video Captioning	MSVD	METEOR	46.4	HowToCaption
Video Captioning	MSVD	ROUGE-L	83.2	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	1	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	50	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	81.4	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	73.2	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	3	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	37.6	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	73.3	HowToCaption
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	62	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	1	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	54.8	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	87.2	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	80.9	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	MSVD	text-to-video Median Rank	2	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	44.5	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	82.1	HowToCaption
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	73.3	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video Median Rank	7	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	27.7	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	54.6	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	46.5	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	LSMDC	text-to-video Median Rank	29	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	17.3	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	38.6	HowToCaption
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	31.7	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video Median Rank	8	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	19.7	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	53.9	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	43.6	VAST, HowToCaption-finetuned
Zero-Shot Video Retrieval	YouCook2	text-to-video Median Rank	15	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	13.4	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	44.1	HowToCaption
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	33.1	HowToCaption

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Abstract

Results

Related Papers

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Abstract

Results

Related Papers