Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Ours

Ours

Reported on 78 benchmarks across 29 tasks · 13 papers · 34 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision55 results

Situation RecognitiononimSitu
Top-1 Verb· 2025-01-20
58.88
SOTA
Dynamic Scene Understanding from Vision-Language Representations arXiv:2501.11653
video narration captioningonShot2Story20K
BLEU-4· 2023-12-16
18.8
SOTA
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos arXiv:2312.10300
video narration captioningonShot2Story20K
CIDEr· 2023-12-16
168.7
SOTA
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos arXiv:2312.10300
video narration captioningonShot2Story20K
METEOR· 2023-12-16
24.8
SOTA
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos arXiv:2312.10300
video narration captioningonShot2Story20K
ROUGE· 2023-12-16
39
SOTA
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos arXiv:2312.10300
Shape Representation Of 3D Point CloudsonModelNet40
Classification Accuracy· 2023-04-27
93.6
SOTA
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
3D Object ClassificationonModelNet40
Classification Accuracy· 2023-04-27
93.6
SOTA
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
3D Point Cloud ClassificationonModelNet40
Classification Accuracy· 2023-04-27
93.6
SOTA
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
3D Point Cloud ReconstructiononModelNet40
Classification Accuracy· 2023-04-27
93.6
SOTA
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
VideoonUCF101
Top-1· 2022-03-29
62.03
SOTA
SPAct: Self-supervised Privacy Preservation for Action Recognition arXiv:2203.15205
VideoonActivityNet
video-to-text R@1· 2021-10-21
26.1
best: 69.7 (InternVideo2-6B)
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonActivityNet
video-to-text R@5· 2021-10-21
60
best: 89.1 (UMT-L (ViT-L/16))
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonLSMDC
video-to-text R@1· 2021-10-21
15.3
best: 46.7 (InternVideo2-6B)
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonLSMDC
video-to-text R@5· 2021-10-21
34.1
best: 71.8 (HunYuan_tvr (huge))
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonActivityNet
video-to-text R@1· 2021-10-21
26.1
best: 69.7 (InternVideo2-6B)
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonActivityNet
video-to-text R@5· 2021-10-21
60
best: 89.1 (UMT-L (ViT-L/16))
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonLSMDC
video-to-text R@1· 2021-10-21
15.3
best: 46.7 (InternVideo2-6B)
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonLSMDC
video-to-text R@5· 2021-10-21
34.1
best: 71.8 (HunYuan_tvr (huge))
SOTA
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonYouTube
Average· 2021-08-11
74.9
SOTA
Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation arXiv:2108.05076
Video Object SegmentationonYouTube
Average· 2021-08-11
74.9
SOTA
Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation arXiv:2108.05076
DeblurringonSecond dialogue state tracking challenge
MAE· 2021-04-16
0.0377
SOTA
Attention! Stay Focus!arXiv:2104.07925
Face ReconstructiononAFLW-LFPA
NME· 2018-08-14
3.02
SOTA
Hierarchical binary CNNs for landmark localization with limited resources arXiv:1808.04803
3D Face ReconstructiononAFLW-LFPA
NME· 2018-08-14
3.02
SOTA
Hierarchical binary CNNs for landmark localization with limited resources arXiv:1808.04803
VideoonActivityNet
text-to-video R@1· 2021-10-21
25.4
best: 74.1 (InternVideo2-6B)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonActivityNet
text-to-video R@5· 2021-10-21
59.1
best: 90.9 (VAST)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonMSR-VTT
text-to-video Median Rank· 2021-10-21
3
best: 55 (C+LSTM+SA+FC7)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonMSR-VTT
text-to-video R@1· 2021-10-21
26
best: 64 (GRAM)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonMSR-VTT
text-to-video R@5· 2021-10-21
56.7
best: 84.3 (VAST)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonMSR-VTT
video-to-text Median Rank· 2021-10-21
3
best: 16 (JEMC)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonMSR-VTT
video-to-text R@1· 2021-10-21
26.7
best: 64.8 (GRAM)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonMSR-VTT
video-to-text R@5· 2021-10-21
56.5
best: 86.2 (CAMoE)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonLSMDC
text-to-video R@1· 2021-10-21
14.9
best: 46.4 (InternVideo2-6B)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
VideoonLSMDC
text-to-video R@5· 2021-10-21
33.2
best: 80.1 (HunYuan_tvr (huge))
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonActivityNet
text-to-video R@1· 2021-10-21
25.4
best: 74.1 (InternVideo2-6B)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonActivityNet
text-to-video R@5· 2021-10-21
59.1
best: 90.9 (VAST)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonMSR-VTT
text-to-video Median Rank· 2021-10-21
3
best: 55 (C+LSTM+SA+FC7)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonMSR-VTT
text-to-video R@1· 2021-10-21
26
best: 64 (GRAM)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonMSR-VTT
text-to-video R@5· 2021-10-21
56.7
best: 84.3 (VAST)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonMSR-VTT
video-to-text Median Rank· 2021-10-21
3
best: 16 (JEMC)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonMSR-VTT
video-to-text R@1· 2021-10-21
26.7
best: 64.8 (GRAM)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonMSR-VTT
video-to-text R@5· 2021-10-21
56.5
best: 86.2 (CAMoE)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonLSMDC
text-to-video R@1· 2021-10-21
14.9
best: 46.4 (InternVideo2-6B)
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Video RetrievalonLSMDC
text-to-video R@5· 2021-10-21
33.2
best: 80.1 (HunYuan_tvr (huge))
Video and Text Matching with Conditioned Embeddings arXiv:2110.11298
Image ClassificationonCUB 200 5-way 1-shot
Accuracy· 2020-10-07
79.12
best: 95.8 (PT+MAP+SF+SOT (transductive))
Variational Feature Disentangling for Fine-Grained Few-Shot Classification arXiv:2010.03255
Few-Shot Image ClassificationonCUB 200 5-way 1-shot
Accuracy· 2020-10-07
79.12
best: 95.8 (PT+MAP+SF+SOT (transductive))
Variational Feature Disentangling for Fine-Grained Few-Shot Classification arXiv:2010.03255
VideoonDAVIS 2016
Jaccard (Mean)· 2020-08-04
83.4
best: 92.5 (ISVOS (BL30K, MS))
Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation arXiv:2008.01270
Video Object SegmentationonDAVIS 2016
Jaccard (Mean)· 2020-08-04
83.4
best: 92.5 (ISVOS (BL30K, MS))
Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation arXiv:2008.01270
3D Human Pose EstimationonHumanEva-I
Mean Reconstruction Error (mm)· 2018-08-17
64
best: 9.2 (GLA-GCN (T=27, GT))
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation arXiv:1808.05942
Pose EstimationonHumanEva-I
Mean Reconstruction Error (mm)· 2018-08-17
64
best: 9.2 (GLA-GCN (T=27, GT))
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation arXiv:1808.05942
Shape Representation Of 3D Point CloudsonScanObjectNN
Mean Accuracy
87.2
best: 93.8 (GPSFormer)
Shape Representation Of 3D Point CloudsonScanObjectNN
Overall Accuracy
89
best: 97.2 (OmniVec2)
3D Point Cloud ClassificationonScanObjectNN
Mean Accuracy
87.2
best: 93.8 (GPSFormer)
3D Point Cloud ClassificationonScanObjectNN
Overall Accuracy
89
best: 97.2 (OmniVec2)
3D Point Cloud ReconstructiononScanObjectNN
Mean Accuracy
87.2
best: 93.8 (GPSFormer)
3D Point Cloud ReconstructiononScanObjectNN
Overall Accuracy
89
best: 97.2 (OmniVec2)

Medical7 results

3D ClassificationonModelNet40
Classification Accuracy· 2023-04-27
93.6
SOTA
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
3D Face ModellingonAFLW-LFPA
NME· 2018-08-14
3.02
SOTA
Hierarchical binary CNNs for landmark localization with limited resources arXiv:1808.04803
Semantic SegmentationonShapeNet-Part
Instance Average IoU· 2023-04-27
86.2
best: 89.1 (GeomGCNN)
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
Semantic SegmentationonS3DIS Area5
mAcc
80.2
best: 81.6 (Sonata + PTv3)
Semantic SegmentationonS3DIS Area5
mIoU
73.6
best: 76 (Sonata + PTv3)
Semantic SegmentationonS3DIS Area5
oAcc
93
Semantic SegmentationonShapeNet-Part
Instance Average IoU
88.1
best: 89.1 (GeomGCNN)

Audio7 results

10-shot image generationonSecond dialogue state tracking challenge
MAE· 2021-04-16
0.0377
SOTA
Attention! Stay Focus!arXiv:2104.07925
10-shot image generationonShapeNet-Part
Instance Average IoU· 2023-04-27
86.2
best: 89.1 (GeomGCNN)
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
1 Image, 2*2 StitchionHumanEva-I
Mean Reconstruction Error (mm)· 2018-08-17
64
best: 9.2 (GLA-GCN (T=27, GT))
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation arXiv:1808.05942
10-shot image generationonS3DIS Area5
mAcc
80.2
best: 81.6 (Sonata + PTv3)
10-shot image generationonS3DIS Area5
mIoU
73.6
best: 76 (Sonata + PTv3)
10-shot image generationonS3DIS Area5
oAcc
93
10-shot image generationonShapeNet-Part
Instance Average IoU
88.1
best: 89.1 (GeomGCNN)

Methodology5 results

3DonModelNet40
Classification Accuracy· 2023-04-27
93.6
SOTA
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124
2D ClassificationonSecond dialogue state tracking challenge
MAE· 2021-04-16
0.0377
SOTA
Attention! Stay Focus!arXiv:2104.07925
3DonAFLW-LFPA
NME· 2018-08-14
3.02
SOTA
Hierarchical binary CNNs for landmark localization with limited resources arXiv:1808.04803
3DonAFLW2000-3D
NME· 2018-08-14
3.26
SOTA
Hierarchical binary CNNs for landmark localization with limited resources arXiv:1808.04803
3DonHumanEva-I
Mean Reconstruction Error (mm)· 2018-08-17
64
best: 9.2 (GLA-GCN (T=27, GT))
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation arXiv:1808.05942

Graphs1 result

Point Cloud ClassificationonISPRS
Average F1· 2023-04-27
82.8
SOTA
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation arXiv:2304.14124

Computer Code1 result

Blind Image DeblurringonSecond dialogue state tracking challenge
MAE· 2021-04-16
0.0377
SOTA
Attention! Stay Focus!arXiv:2104.07925

Natural Language Processing1 result

Data-to-Text GenerationonWikipedia Person and Animal Dataset
BLEU· 2020-05-03
24.56
SOTA
Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints arXiv:2005.00969

Music1 result

Facial Recognition and ModellingonAFLW-LFPA
NME· 2018-08-14
3.02
SOTA
Hierarchical binary CNNs for landmark localization with limited resources arXiv:1808.04803

Other1 result

Local DistortiononDocUNet
LD· 2022-03-31
9.36
best: 14.08 (DocUNet)
Revisiting Document Image Dewarping by Grid Regularization arXiv:2203.16850

Adversarial1 result

Text GenerationonWikipedia Person and Animal Dataset
BLEU· 2020-05-03
24.56
best: 25.22 (VTM)
Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints arXiv:2005.00969