TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/YouTube-8M: A Large-Scale Video Classification Benchmark

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan

2016-09-27Video ClassificationGeneral ClassificationAction RecognitionAction Recognition In Videos3D Face Reconstruction
PaperPDFCodeCodeCodeCodeCodeCodeCode

Abstract

Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

Results

TaskDatasetMetricValueModel
VideoYouTube-8MHit@170.1Mixture-of-2-Experts
VideoYouTube-8MHit@584.8Mixture-of-2-Experts
VideoYouTube-8MPERR29.1Mixture-of-2-Experts
Activity RecognitionActivityNetmAP75.6LSTM + Pretrained on YT-8M
Activity RecognitionSports-1MVideo hit@165.7LSTM +Pretrained on YT-8M
Activity RecognitionSports-1MVideo hit@586.2LSTM +Pretrained on YT-8M
Action RecognitionActivityNetmAP75.6LSTM + Pretrained on YT-8M
Action RecognitionSports-1MVideo hit@165.7LSTM +Pretrained on YT-8M
Action RecognitionSports-1MVideo hit@586.2LSTM +Pretrained on YT-8M
Action Recognition In VideosActivityNetmAP75.6LSTM + Pretrained on YT-8M
Action Recognition In VideosSports-1MVideo hit@165.7LSTM +Pretrained on YT-8M
Action Recognition In VideosSports-1MVideo hit@586.2LSTM +Pretrained on YT-8M
Video ClassificationYouTube-8MHit@170.1Mixture-of-2-Experts
Video ClassificationYouTube-8MHit@584.8Mixture-of-2-Experts
Video ClassificationYouTube-8MPERR29.1Mixture-of-2-Experts

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22