TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Robots/Activity Recognition/Something-Something V2

Activity Recognition on Something-Something V2

Metric: Top-5 Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Top-5 Accuracy▼Extra DataPaperDate↕Code
1DejaVid96.3Yes--Code
2VideoMAE V2-g95.9YesVideoMAE V2: Scaling Video Masked Autoencoders w...2023-03-29Code
3MVD (Kinetics400 pretrain, ViT-H, 16 frame)95.7YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
4MVD (Kinetics400 pretrain, ViT-L, 16 frame)95.5YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
5TubeViT-L95.2NoRethinking Video ViTs: Sparse Video Tubes for Jo...2022-12-06Code
6VideoMAE (no extra data, ViT-L, 32x2)95.2NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
7MaskFeat (Kinetics600 pretrain, MViT-L)95YesMasked Feature Prediction for Self-Supervised Vi...2021-12-16Code
8MAR (50% mask, ViT-L, 16x4)94.9NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
9VideoMAE (no extra data, ViT-L, 16frame)94.6NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
10UniFormerV2-L94.5Yes--Code
11ATM94.4NoWhat Can Simple Arithmetic Operations Do for Tem...2023-07-18Code
12MAR (75% mask, ViT-L, 16x4)94.4NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
13MViTv2-L (IN-21K + Kinetics400 pretrain)94.1NoMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
14Side4Video (EVA ViT-E/14)94NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
15MVD (Kinetics400 pretrain, ViT-B, 16 frame)94YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
16AMD(ViT-B/16)94NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
17ST-Adapter (ViT-L, CLIP)93.9YesST-Adapter: Parameter-Efficient Image-to-Video T...2022-06-27Code
18TDS-CLIP-ViT-L/14(8frames)93.8NoTDS-CLIP: Temporal Difference Side Network for I...2024-08-20Code
19OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)93.5YesOmnivore: A Single Model for Many Visual Modalit...2022-01-20Code
20MViTv2-B (IN-21K + Kinetics400 pretrain)93.4NoMViTv2: Improved Multiscale Vision Transformers ...2021-12-02Code
21ZeroI2V ViT-L/1493YesZeroI2V: Zero-Cost Adaptation of Pre-trained Tra...2023-10-02Code
22UniFormer-B (IN-1K + Kinetics400 pretrain)92.8Yes--Code
23MAR (50% mask, ViT-B, 16x4)92.8NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
24MVD (Kinetics400 pretrain, ViT-S, 16 frame)92.8YesMasked Video Distillation: Rethinking Masked Fea...2022-12-08Code
25MorphMLP-B (IN-1K)92.8YesMorphMLP: An Efficient MLP-Like Backbone for Spa...2021-11-24Code
26Swin-B (IN-21K + Kinetics400 pretrain)92.7YesVideo Swin Transformer2021-06-24Code
27MML (ensemble)92.7YesMutual Modality Learning for Video Action Classi...2020-11-04Code
28CoVeR(JFT-3B)92.5YesCo-training Transformer with Videos and Images I...2021-12-14-
29AMD(ViT-S/16)92.5NoAsymmetric Masked Distillation for Pre-Training ...2023-11-06-
30VideoMAE (no extra data, ViT-B, 16frame)92.4NoVideoMAE: Masked Autoencoders are Data-Efficient...2022-03-23Code
31TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)92.2YesTDN: Temporal Difference Networks for Efficient ...2020-12-18Code
32UniFormer-S (IN-1K + Kinetics600 pretrain)92.1Yes--Code
33CoVeR(JFT-300M)91.9YesCo-training Transformer with Videos and Images I...2021-12-14-
34MAR (75% mask, ViT-B, 16x4)91.9NoMAR: Masked Autoencoders for Efficient Action Re...2022-07-24Code
35ILA (ViT-L/14)91.8NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
36TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)91.6YesTDN: Temporal Difference Networks for Efficient ...2020-12-18Code
37ORViT Mformer-L (ORViT blocks)91.5YesObject-Region Video Transformers2021-10-13Code
38MViT-B-24, 32x391.5YesMultiscale Vision Transformers2021-04-22Code
39TRG (Inception-V3)91.4NoTemporal Reasoning Graph for Activity Recognition2019-08-27-
40MViT-B, 32x3(Kinetics600 pretrain)91.3YesMultiscale Vision Transformers2021-04-22Code
41MML (single)91.3YesMutual Modality Learning for Video Action Classi...2020-11-04Code
42TSM (RGB + Flow)91.3YesTSM: Temporal Shift Module for Efficient Video U...2018-11-20Code
43Mformer-L91.2YesKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
44GC-TDN Ensemble (R50,8+16)91.2YesGroup Contextualization for Video Recognition2022-03-18Code
45CT-Net Ensemble (R50, 8+12+16+24)91.1YesCT-Net: Channel Tensorization Network for Video ...2021-06-03Code
46SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)91.1YesLearning Self-Similarity in Space and Time as Ge...2021-02-14Code
47RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips91.1YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
48RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)91.1NoRelational Self-Attention: What's Missing in Att...2021-11-02Code
49SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)91YesLearning Self-Similarity in Space and Time as Ge...2021-02-14Code
50PLAR91NoSCP: Soft Conditional Prompt Learning for Aerial...2023-05-21-
51RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)90.8YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
52X-Vit (x16)90.8YesSpace-time Mixing Attention for Video Transformer2021-06-10Code
53Mformer-HR90.6YesKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
54MSNet-R50En (8+16 ensemble, ImageNet pretrained)90.6YesMotionSqueeze: Neural Motion Feature Learning fo...2020-07-20Code
55PAN ResNet101 (RGB only, no Flow)90.6YesPAN: Towards Fast Action Recognition via Learnin...2020-08-08Code
56ORViT Mformer (ORViT blocks)90.5YesObject-Region Video Transformers2021-10-13Code
57VoV3D-L (32frames, Kinetics pretrained, single)90.5YesDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
58MTV-B90.4YesMultiview Transformers for Video Recognition2022-01-12Code
59TAdaConvNeXt-T90.4YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
60TSM+W3 (16 frames, RGB ResNet-50)90.4YesKnowing What, Where and When to Look: Efficient ...2020-04-02-
61ILA (ViT-B/16)90.3NoImplicit Temporal Modeling with Learnable Alignm...2023-04-20Code
62TRG (ResNet-50)90.3NoTemporal Reasoning Graph for Activity Recognition2019-08-27-
63MViT-B, 16x490.2YesMultiscale Vision Transformers2021-04-22Code
64Mformer90.1YesKeeping Your Eye on the Ball: Trajectory Attenti...2021-06-09Code
65TAda2D-En (ResNet-50, 8+16 frames)89.8YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
66RSANet-R50 (16 frames, ImageNet pretrained, a single clip)89.8YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
67E3D-L89.8NoMaximizing Spatio-Temporal Entropy of Deep 3D CN...2023-03-05Code
68SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)89.8YesLearning Self-Similarity in Space and Time as Ge...2021-02-14Code
69ViViT-L/16x2 Fact. encoder89.8YesViViT: A Video Vision Transformer2021-03-29Code
70STM (16 frames, ImageNet pretraining)89.8NoSTM: SpatioTemporal and Motion Encoding for Acti...2019-08-07-
71VoV3D-L (32frames, from scratch, single)89.5NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
72VoV3D-M (32frames, Kinetics pretrained, single)89.48YesDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
73MSNet-R50 (16 frames, ImageNet pretrained)89.4YesMotionSqueeze: Neural Motion Feature Learning fo...2020-07-20Code
74CCS + two-stream + TRN89.3NoCooperative Cross-Stream Network for Discriminat...2019-08-27-
75TAda2D (ResNet-50, 16 frames)89.2YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
76RSANet-R50 (8 frames, ImageNet pretrained, a single clip)89.1YesRelational Self-Attention: What's Missing in Att...2021-11-02Code
77MoViNet-A289NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
78MoViNet-A189NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
79VoV3D-M (32frames, from scratch, single)88.8NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
80VoV3D-L (16frames, from scratch, single)88.6NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
81MSNet-R50 (8 frames, ImageNet pretrained)88.4YesMotionSqueeze: Neural Motion Feature Learning fo...2020-07-20Code
82VoV3D-M (16frames, from scratch, single)88.2NoDiverse Temporal Aggregation and Depthwise Spati...2020-12-01Code
83MoViNet-A088.2NoMoViNets: Mobile Video Networks for Efficient Vi...2021-03-21Code
84TAda2D (ResNet-50, 8 frames)88YesTAda! Temporally-Adaptive Convolutions for Video...2021-10-12Code
85DirecFormer87.9NoDirecFormer: A Directed Attention in Transformer...2022-03-19Code
86OmniVL86.2NoOmniVL:One Foundation Model for Image-Language a...2022-09-15-
87CPNet Res34, 5 CP83.95NoLearning Video Representations from Corresponden...2019-05-20Code
882-Stream TRN83.06NoTemporal Relational Reasoning in Videos2017-11-22Code
89model3D_1 with left-right augmentation and fps jitter80.46NoThe "something something" video database for lea...2017-06-13Code
90Prob-Distill79.1NoAttention Distillation for Learning Video Repres...2019-04-05-
91InternVideo2-6B12YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code