TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Video Representations from Large Language Models

Learning Video Representations from Large Language Models

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

2022-12-08CVPR 2023 1Self-Supervised Action Recognition LinearAction ClassificationMulti-Instance RetrievalEgocentric Activity RecognitionAction Recognition
PaperPDFCode(official)CodeCode

Abstract

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

Results

TaskDatasetMetricValueModel
Activity RecognitionCharades-EgomAP36.1LaViLa (Finetuned, TimeSformer-L)
Activity RecognitionCharades-EgomAP28.9LaViLa (Zero-shot, TimeSformer-L)
Activity RecognitionEPIC-KITCHENS-100Action@151LaViLa (TimeSformer-L)
Activity RecognitionEPIC-KITCHENS-100Noun@162.9LaViLa (TimeSformer-L)
Activity RecognitionEPIC-KITCHENS-100Verb@172LaViLa (TimeSformer-L)
Activity RecognitionEGTEAAverage Accuracy81.75LaViLa (Finetuned, TimeSformer-L)
Activity RecognitionEGTEAMean class accuracy76LaViLa (Finetuned, TimeSformer-L)
Action RecognitionCharades-EgomAP36.1LaViLa (Finetuned, TimeSformer-L)
Action RecognitionCharades-EgomAP28.9LaViLa (Zero-shot, TimeSformer-L)
Action RecognitionEPIC-KITCHENS-100Action@151LaViLa (TimeSformer-L)
Action RecognitionEPIC-KITCHENS-100Noun@162.9LaViLa (TimeSformer-L)
Action RecognitionEPIC-KITCHENS-100Verb@172LaViLa (TimeSformer-L)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization2025-06-17