Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, Krzysztof Chalupka

2020-03-03CVPR 2020 6Benchmarking Zero-Shot Action Recognition Video Classification General Classification Zero-Shot Learning

Paper PDF Code(official)

Abstract

Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at github.com/bbrattoli/ZeroShotVideoClassification.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Action Recognition	UCF101	Top-1 Accuracy	48	E2E
Zero-Shot Action Recognition	HMDB51	Top-1 Accuracy	32.7	E2E
Zero-Shot Action Recognition	ActivityNet	Top-1 Accuracy	26.6	E2E

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20 Training Transformers with Enforced Lipschitz Constants2025-07-17 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 GLAD: Generalizable Tuning for Vision-Language Models2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15 A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion2025-07-15