TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Co-training Transformer with Videos and Images Improves Ac...

Co-training Transformer with Videos and Images Improves Action Recognition

BoWen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M. Dai, Ruoming Pang, Fei Sha

2021-12-14Action ClassificationObject RecognitionVideo ClassificationAction RecognitionAction Recognition In Videos
PaperPDF

Abstract

In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%), Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time (46.1%), with a simple spatio-temporal video transformer.

Results

TaskDatasetMetricValueModel
VideoKinetics-700Top-1 Accuracy79.8CoVeR (JFT-3B)
VideoKinetics-700Top-5 Accuracy94.9CoVeR (JFT-3B)
VideoKinetics-700Top-1 Accuracy78.5CoVeR (JFT-300M)
VideoKinetics-700Top-5 Accuracy94.2CoVeR (JFT-300M)
VideoMiTTop 1 Accuracy46.1CoVeR(JFT-3B)
VideoMiTTop 5 Accuracy75.4CoVeR(JFT-3B)
VideoMiTTop 1 Accuracy45CoVeR(JFT-300M)
VideoMiTTop 5 Accuracy73.9CoVeR(JFT-300M)
VideoKinetics-400Acc@187.2CoVeR (JFT-3B)
VideoKinetics-400Acc@597.5CoVeR (JFT-3B)
VideoKinetics-400Acc@186.3CoVeR (JFT-300M)
VideoKinetics-400Acc@597.2CoVeR (JFT-300M)
VideoKinetics-600Top-1 Accuracy87.9CoVeR (JFT-3B)
VideoKinetics-600Top-5 Accuracy97.8CoVeR (JFT-3B)
VideoKinetics-600Top-1 Accuracy86.8CoVeR (JFT-300M)
VideoKinetics-600Top-5 Accuracy97.3CoVeR (JFT-300M)
Activity RecognitionSomething-Something V2Top-1 Accuracy70.9CoVeR(JFT-3B)
Activity RecognitionSomething-Something V2Top-5 Accuracy92.5CoVeR(JFT-3B)
Activity RecognitionSomething-Something V2Top-1 Accuracy69.8CoVeR(JFT-300M)
Activity RecognitionSomething-Something V2Top-5 Accuracy91.9CoVeR(JFT-300M)
Action RecognitionSomething-Something V2Top-1 Accuracy70.9CoVeR(JFT-3B)
Action RecognitionSomething-Something V2Top-5 Accuracy92.5CoVeR(JFT-3B)
Action RecognitionSomething-Something V2Top-1 Accuracy69.8CoVeR(JFT-300M)
Action RecognitionSomething-Something V2Top-5 Accuracy91.9CoVeR(JFT-300M)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing2025-07-08Out-of-distribution detection in 3D applications: a review2025-07-01Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25