TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Efficient Video Classification Using Fewer Frames

Efficient Video Classification Using Fewer Frames

Shweta Bhardwaj, Mukundhan Srinivasan, Mitesh M. Khapra

2019-02-27CVPR 2019 6ClusteringVideo ClassificationGeneral ClassificationClassification
PaperPDFCode

Abstract

Recently,there has been a lot of interest in building compact models for video classification which have a small memory footprint (<1 GB). While these models are compact, they typically operate by repeated application of a small weight matrix to all the frames in a video. E.g. recurrent neural network based methods compute a hidden state for every frame of the video using a recurrent weight matrix. Similarly, cluster-and-aggregate based methods such as NetVLAD, have a learnable clustering matrix which is used to assign soft-clusters to every frame in the video. Since these models look at every frame in the video, the number of floating point operations (FLOPs) is still large even though the memory footprint is small. We focus on building compute-efficient video classification models which process fewer frames and hence have less number of FLOPs. Similar to memory efficient models, we use the idea of distillation albeit in a different setting. Specifically, in our case, a compute-heavy teacher which looks at all the frames in the video is used to train a compute-efficient student which looks at only a small fraction of frames in the video. This is in contrast to a typical memory efficient Teacher-Student setting, wherein both the teacher and the student look at all the frames in the video but the student has fewer parameters. Our work thus complements the research on memory efficient video classification. We do an extensive evaluation with three types of models for video classification,viz.(i) recurrent models (ii) cluster-and-aggregate models and (iii) memory-efficient cluster-and-aggregate models and show that in each of these cases, a see-it-all teacher can be used to train a compute efficient see-very-little student. We show that the proposed student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with a negligible drop in the performance.

Results

TaskDatasetMetricValueModel
VideoYouTube-8MGlobal Average Precision81.1Hierarchical LSTM with MoE
VideoYouTube-8MHit@186.8Hierarchical LSTM with MoE
VideoYouTube-8MmAP41.4Hierarchical LSTM with MoE
Video ClassificationYouTube-8MGlobal Average Precision81.1Hierarchical LSTM with MoE
Video ClassificationYouTube-8MHit@186.8Hierarchical LSTM with MoE
Video ClassificationYouTube-8MmAP41.4Hierarchical LSTM with MoE

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Ranking Vectors Clustering: Theory and Applications2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09