Learning Video Representations from Large Language Models

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

2022-12-08CVPR 2023 1Self-Supervised Action Recognition Linear Action Classification Multi-Instance Retrieval Egocentric Activity Recognition Action Recognition

Paper PDF Code(official)Code Code

Abstract

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Charades-Ego	mAP	36.1	LaViLa (Finetuned, TimeSformer-L)
Activity Recognition	Charades-Ego	mAP	28.9	LaViLa (Zero-shot, TimeSformer-L)
Activity Recognition	EPIC-KITCHENS-100	Action@1	51	LaViLa (TimeSformer-L)
Activity Recognition	EPIC-KITCHENS-100	Noun@1	62.9	LaViLa (TimeSformer-L)
Activity Recognition	EPIC-KITCHENS-100	Verb@1	72	LaViLa (TimeSformer-L)
Activity Recognition	EGTEA	Average Accuracy	81.75	LaViLa (Finetuned, TimeSformer-L)
Activity Recognition	EGTEA	Mean class accuracy	76	LaViLa (Finetuned, TimeSformer-L)
Action Recognition	Charades-Ego	mAP	36.1	LaViLa (Finetuned, TimeSformer-L)
Action Recognition	Charades-Ego	mAP	28.9	LaViLa (Zero-shot, TimeSformer-L)
Action Recognition	EPIC-KITCHENS-100	Action@1	51	LaViLa (TimeSformer-L)
Action Recognition	EPIC-KITCHENS-100	Noun@1	62.9	LaViLa (TimeSformer-L)
Action Recognition	EPIC-KITCHENS-100	Verb@1	72	LaViLa (TimeSformer-L)

Learning Video Representations from Large Language Models

Abstract

Results

Related Papers

Learning Video Representations from Large Language Models

Abstract

Results

Related Papers