TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Time Series/Action Recognition/Something-Something V2

Action Recognition on Something-Something V2

Metric: Top-1 Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Top-1 Accuracy▼Extra DataPaperDate↕Code
1MVD (Kinetics400 pretrain, ViT-H, 16 frame)77.3YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
2DejaVid77.2Yes--Code
3InternVideo77.2YesInternVideo: General Video Foundation Models via...2022-12-06Code
4InternVideo2-1B77.1YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
5VideoMAE V2-g77YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
6MVD (Kinetics400 pretrain, ViT-L, 16 frame)76.7YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
7Hiera-L (no extra data)76.5NoHiera: A Hierarchical Vision Transformer without...2023-06-01Code
8TubeViT-L76.1NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
9VideoMAE (no extra data, ViT-L, 32x2)75.4NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
10Side4Video (EVA ViT-E/14)75.2NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
11MaskFeat (Kinetics600 pretrain, MViT-L)75YesMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
12MAR (50% mask, ViT-L, 16x4)74.7NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
13ATM74.6NoWhat Can Simple Arithmetic Operations Do for Tem...2023-07-18Code
14MAWS (ViT-L)74.4YesThe effectiveness of MAE pre-pretraining for bil...2023-03-23Code
15VideoMAE (no extra data, ViT-L, 16frame)74.3NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
16MAR (75% mask, ViT-L, 16x4)73.8NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
17MVD (Kinetics400 pretrain, ViT-B, 16 frame)73.7YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
18ViC-MAE (ViT-L)73.7NoViC-MAE: Self-Supervised Representation Learning...2023-03-21Code
19TAdaFormer-L/1473.6YesTemporally-Adaptive Models for Efficient Video U...2023-08-10Code
20TDS-CLIP-ViT-L/14(8frames)73.4NoTDS-CLIP: Temporal Difference Side Network for I...2024-08-20Code
21MViTv2-L (IN-21K + Kinetics400 pretrain)73.3NoMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
22AMD(ViT-B/16)73.3NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
23UniFormerV2-L73Yes--Code
24ST-Adapter (ViT-L, CLIP)72.3YesST-Adapter: Parameter-Efficient Image-to-Video T...2022-06-27Code
25ZeroI2V ViT-L/1472.2YesZeroI2V: Zero-Cost Adaptation of Pre-trained Tra...2023-10-02Code
26MViT-B (IN-21K + Kinetics400 pretrain)72.1YesMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
27CAST(ViT-B/16)71.6NoCAST: Cross-Attention in Space and Time for Vide...2023-11-30Code
28StructVit-B-4-171.5NoLearning Correlation Structures for Vision Trans...2024-04-05-
29OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)71.4YesOmnivore: A Single Model for Many Visual Modalit...2022-01-20Code
30BEVT (IN-1K + Kinetics400 pretrain)71.4YesBEVT: BERT Pretraining of Video Transformers2021-12-02Code
31UniFormer-B (IN-1K + Kinetics400 pretrain)71.2Yes--Code
32TAdaConvNeXtV2-B71.1YesTemporally-Adaptive Models for Efficient Video U...2023-08-10Code
33MAR (50% mask, ViT-B, 16x4)71NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
34MVD (Kinetics400 pretrain, ViT-S, 16 frame)70.9YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
35CoVeR(JFT-3B)70.9YesCo-training Transformer with Videos and Images I...2021-12-14-
36VideoMAE (no extra data, ViT-B, 16frame)70.8NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
37AMD(ViT-S/16)70.2NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
38ILA (ViT-L/14)70.2NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
39MorphMLP-B (IN-1K)70.1YesMorphMLP: An Efficient MLP-Like Backbone for Spa...2021-11-24Code
40CoVeR(JFT-300M)69.8YesCo-training Transformer with Videos and Images I...2021-12-14-
41TPS69.8NoSpatiotemporal Self-attention Modeling with Temp...2022-07-27Code
42SIFA69.8NoStand-Alone Inter-Frame Attention in Video Models2022-06-14Code
43Swin-B (IN-21K + Kinetics400 pretrain)69.6YesVideo Swin Transformer2021-06-24Code
44TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)69.6YesTDN: Temporal Difference Networks for Efficient ...2020-12-18Code
45MAR (75% mask, ViT-B, 16x4)69.5NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
46ORViT Mformer-L (ORViT blocks)69.5YesObject-Region Video Transformers2021-10-13Code
47UniFormer-S (IN-1K + Kinetics600 pretrain)69.4Yes--Code
48MML (ensemble)69.02YesMutual Modality Learning for Video Action Classi...2020-11-04Code
49MViT-B-24, 32x368.7YesMultiscale Vision Transformers2021-04-22Code
50MTV-B68.5YesMultiview Transformers for Video Recognition2022-01-12Code
51MLP-3D68.5NoMLP-3D: A MLP-like 3D Architecture with Grouped ...2022-06-13-
52TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)68.2YesTDN: Temporal Difference Networks for Efficient ...2020-12-18Code
53MSMA (8+16frames)68.2No---
54Mformer-L68.1YesKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
55VIMPAC68.1YesVIMPAC: Video Pre-Training via Masked Token Pred...2021-06-21Code
56ORViT Mformer (ORViT blocks)67.9YesObject-Region Video Transformers2021-10-13Code
57MViT-B, 32x3(Kinetics600 pretrain)67.8YesMultiscale Vision Transformers2021-04-22Code
58GC-TDN Ensemble (R50,8+16)67.8YesGroup Contextualization for Video Recognition2022-03-18Code
59CT-Net Ensemble (R50, 8+12+16+24)67.8YesCT-Net: Channel Tensorization Network for Video ...2021-06-03Code
60TCM (Ensemble)67.8NoMotion-driven Visual Tempo Learning for Video-ba...2022-02-24Code
61SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)67.7YesLearning Self-Similarity in Space and Time as Ge...2021-02-14Code
62RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips67.7YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
63GTDNet67.6No---
64SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)67.4YesLearning Self-Similarity in Space and Time as Ge...2021-02-14Code
65VoV3D-L (32frames, Kinetics pretrained, single)67.35YesDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
66PLAR67.3NoSCP: Soft Conditional Prompt Learning for Aerial...2023-05-21-
67RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)67.3YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
68X-Vit (x16)67.2YesSpace-time Mixing Attention for Video Transformer2021-06-10Code
69TAda2D-En (ResNet-50, 8+16 frames)67.2YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
70Mformer-HR67.1YesKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
71TAdaConvNeXt-T67.1YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
72MoDS (8+16frames)67.1No---
73STPG (8+16frames)67No---
74MML (single)66.83YesMutual Modality Learning for Video Action Classi...2020-11-04Code
75ILA (ViT-B/16)66.8NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
76TSM (RGB + Flow)66.6YesTSM: Temporal Shift Module for Efficient Video U...2018-11-20Code
77MSNet-R50En (8+16 ensemble, ImageNet pretrained)66.6YesMotionSqueeze: Neural Motion Feature Learning fo...2020-07-20Code
78PAN ResNet101 (RGB only, no Flow)66.5YesPAN: Towards Fast Action Recognition via Learnin...2020-08-08Code
79TSM+W3 (16 frames, RGB ResNet-50)66.5YesKnowing What, Where and When to Look: Efficient ...2020-04-02-
80Mformer66.5YesKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
81MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)66.3YesMVFNet: Multi-View Fusion Network for Efficient ...2020-12-13Code
82MViT-B, 16x466.2YesMultiscale Vision Transformers2021-04-22Code
83RSANet-R50 (16 frames, ImageNet pretrained, a single clip)66YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
84VoV3D-L (32frames, from scratch, single)65.8NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
85E3D-L65.7NoMaximizing Spatio-Temporal Entropy of Deep 3D CN...2023-03-05Code
86SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)65.7YesLearning Self-Similarity in Space and Time as Ge...2021-02-14Code
87TAda2D (ResNet-50, 16 frames)65.6YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
88ViViT-L/16x2 Fact. encoder65.4YesViViT: A Video Vision Transformer2021-03-29Code
89VoV3D-M (32frames, Kinetics pretrained, single)65.24YesDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
90bLVNet65.2YesMore Is Less: Learning Efficient Video Represent...2019-12-02Code
91DirecFormer64.94NoDirecFormer: A Directed Attention in Transformer...2022-03-19Code
92RSANet-R50 (8 frames, ImageNet pretrained, a single clip)64.8YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
93MSNet-R50 (16 frames, ImageNet pretrained)64.7YesMotionSqueeze: Neural Motion Feature Learning fo...2020-07-20Code
94AK-Net64.3NoAction Keypoint Network for Efficient Video Reco...2022-01-17-
95VoV3D-M (32frames, from scratch, single)64.2NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
96STM (16 frames, ImageNet pretraining)64.2NoSTM: SpatioTemporal and Motion Encoding for Acti...2019-08-07-
97VoV3D-L (16frames, from scratch, single)64.1NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
98TAda2D (ResNet-50, 8 frames)64YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
99MoViNet-A263.5NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
100VoV3D-M (16frames, from scratch, single)63.2NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
101MSNet-R50 (8 frames, ImageNet pretrained)63YesMotionSqueeze: Neural Motion Feature Learning fo...2020-07-20Code
102MoViNet-A162.7NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
103OmniVL62.5NoOmniVL:One Foundation Model for Image-Language a...2022-09-15-
104TimeSformer-HR62.5YesIs Space-Time Attention All You Need for Video U...2021-02-09Code
105TimeSformer-L62.3YesIs Space-Time Attention All You Need for Video U...2021-02-09Code
106TRG (ResNet-50)62.2NoTemporal Reasoning Graph for Activity Recognition2019-08-27-
107TPN (TSM-50)62NoTemporal Pyramid Network for Action Recognition2020-04-07Code
108Multigrid61.7YesA Multigrid Method for Efficiently Training Vide...2019-12-02Code
109SlowFast61.7YesSlowFast Networks for Video Recognition2018-12-10Code
110TRG (Inception-V3)61.3NoTemporal Reasoning Graph for Activity Recognition2019-08-27-
111MoViNet-A061.3NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
112CCS + two-stream + TRN61.2NoCooperative Cross-Stream Network for Discriminat...2019-08-27-
113VidTr-L60.2NoVidTr: Video Transformer Without Convolutions2021-04-23-
114TimeSformer59.5YesIs Space-Time Attention All You Need for Video U...2021-02-09Code
115SVT59.2NoSelf-supervised Video Transformer2021-12-02Code
116CPNet Res34, 5 CP57.65NoLearning Video Representations from Corresponden...2019-05-20Code
1172-Stream TRN55.52NoTemporal Relational Reasoning in Videos2017-11-22Code
118TAM (5-shot)52.3NoFew-Shot Video Classification via Temporal Align...2019-06-27-
119model3D_1 with left-right augmentation and fps jitter51.33NoThe "something something" video database for lea...2017-06-13Code
120Prob-Distill49.9NoAttention Distillation for Learning Video Repres...2019-04-05-
121STM + TRNMultiscale47.73NoComparative Analysis of CNN-based Spatiotemporal...2019-09-11Code
122DIN34.11NoDenseImage Network: Video Spatial-Temporal Evolu...2018-05-19-
123InternVideo2-6B1YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code