YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan

2016-09-27Video Classification General Classification Action Recognition Action Recognition In Videos 3D Face Reconstruction

Paper PDF Code Code Code Code Code Code Code

Abstract

Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

Results

Task	Dataset	Metric	Value	Model
Video	YouTube-8M	Hit@1	70.1	Mixture-of-2-Experts
Video	YouTube-8M	Hit@5	84.8	Mixture-of-2-Experts
Video	YouTube-8M	PERR	29.1	Mixture-of-2-Experts
Activity Recognition	ActivityNet	mAP	75.6	LSTM + Pretrained on YT-8M
Activity Recognition	Sports-1M	Video hit@1	65.7	LSTM +Pretrained on YT-8M
Activity Recognition	Sports-1M	Video hit@5	86.2	LSTM +Pretrained on YT-8M
Action Recognition	ActivityNet	mAP	75.6	LSTM + Pretrained on YT-8M
Action Recognition	Sports-1M	Video hit@1	65.7	LSTM +Pretrained on YT-8M
Action Recognition	Sports-1M	Video hit@5	86.2	LSTM +Pretrained on YT-8M
Action Recognition In Videos	ActivityNet	mAP	75.6	LSTM + Pretrained on YT-8M
Action Recognition In Videos	Sports-1M	Video hit@1	65.7	LSTM +Pretrained on YT-8M
Action Recognition In Videos	Sports-1M	Video hit@5	86.2	LSTM +Pretrained on YT-8M
Video Classification	YouTube-8M	Hit@1	70.1	Mixture-of-2-Experts
Video Classification	YouTube-8M	Hit@5	84.8	Mixture-of-2-Experts
Video Classification	YouTube-8M	PERR	29.1	Mixture-of-2-Experts

YouTube-8M: A Large-Scale Video Classification Benchmark

Abstract

Results

Related Papers

YouTube-8M: A Large-Scale Video Classification Benchmark

Abstract

Results

Related Papers