End-to-end Generative Pretraining for Multimodal Video Captioning

Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid

2022-01-20CVPR 2022 1Video Retrieval Action Classification Video Captioning Video Understanding Retrieval

Abstract

Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.

Results

Task	Dataset	Metric	Value	Model
Video Captioning	MSR-VTT	BLEU-4	48.9	MV-GPT
Video Captioning	MSR-VTT	CIDEr	60	MV-GPT
Video Captioning	MSR-VTT	METEOR	38.7	MV-GPT
Video Captioning	MSR-VTT	ROUGE-L	64	MV-GPT

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Context-Aware Search and Retrieval Over Erasure Channels2025-07-16