TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Ours

Ours

Reported on 78 benchmarks across 29 tasks · 13 papers · 34 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision55 results

  • Situation RecognitiononimSitu
    Top-1 Verb· 2025-01-20
    58.88
    SOTA
    Dynamic Scene Understanding from Vision-Language RepresentationsarXiv:2501.11653
  • video narration captioningonShot2Story20K
    BLEU-4· 2023-12-16
    18.8
    SOTA
    Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosarXiv:2312.10300
  • video narration captioningonShot2Story20K
    CIDEr· 2023-12-16
    168.7
    SOTA
    Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosarXiv:2312.10300
  • video narration captioningonShot2Story20K
    METEOR· 2023-12-16
    24.8
    SOTA
    Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosarXiv:2312.10300
  • video narration captioningonShot2Story20K
    ROUGE· 2023-12-16
    39
    SOTA
    Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosarXiv:2312.10300
  • Shape Representation Of 3D Point CloudsonModelNet40
    Classification Accuracy· 2023-04-27
    93.6
    SOTA
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • 3D Object ClassificationonModelNet40
    Classification Accuracy· 2023-04-27
    93.6
    SOTA
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • 3D Point Cloud ClassificationonModelNet40
    Classification Accuracy· 2023-04-27
    93.6
    SOTA
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • 3D Point Cloud ReconstructiononModelNet40
    Classification Accuracy· 2023-04-27
    93.6
    SOTA
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • VideoonUCF101
    Top-1· 2022-03-29
    62.03
    SOTA
    SPAct: Self-supervised Privacy Preservation for Action RecognitionarXiv:2203.15205
  • VideoonActivityNet
    video-to-text R@1· 2021-10-21
    26.1
    best: 69.7 (InternVideo2-6B)
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonActivityNet
    video-to-text R@5· 2021-10-21
    60
    best: 89.1 (UMT-L (ViT-L/16))
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonLSMDC
    video-to-text R@1· 2021-10-21
    15.3
    best: 46.7 (InternVideo2-6B)
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonLSMDC
    video-to-text R@5· 2021-10-21
    34.1
    best: 71.8 (HunYuan_tvr (huge))
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonActivityNet
    video-to-text R@1· 2021-10-21
    26.1
    best: 69.7 (InternVideo2-6B)
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonActivityNet
    video-to-text R@5· 2021-10-21
    60
    best: 89.1 (UMT-L (ViT-L/16))
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonLSMDC
    video-to-text R@1· 2021-10-21
    15.3
    best: 46.7 (InternVideo2-6B)
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonLSMDC
    video-to-text R@5· 2021-10-21
    34.1
    best: 71.8 (HunYuan_tvr (huge))
    SOTA
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonYouTube
    Average· 2021-08-11
    74.9
    SOTA
    Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object SegmentationarXiv:2108.05076
  • Video Object SegmentationonYouTube
    Average· 2021-08-11
    74.9
    SOTA
    Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object SegmentationarXiv:2108.05076
  • DeblurringonSecond dialogue state tracking challenge
    MAE· 2021-04-16
    0.0377
    SOTA
    Attention! Stay Focus!arXiv:2104.07925
  • Face ReconstructiononAFLW-LFPA
    NME· 2018-08-14
    3.02
    SOTA
    Hierarchical binary CNNs for landmark localization with limited resourcesarXiv:1808.04803
  • 3D Face ReconstructiononAFLW-LFPA
    NME· 2018-08-14
    3.02
    SOTA
    Hierarchical binary CNNs for landmark localization with limited resourcesarXiv:1808.04803
  • VideoonActivityNet
    text-to-video R@1· 2021-10-21
    25.4
    best: 74.1 (InternVideo2-6B)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonActivityNet
    text-to-video R@5· 2021-10-21
    59.1
    best: 90.9 (VAST)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonMSR-VTT
    text-to-video Median Rank· 2021-10-21
    3
    best: 55 (C+LSTM+SA+FC7)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonMSR-VTT
    text-to-video R@1· 2021-10-21
    26
    best: 64 (GRAM)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonMSR-VTT
    text-to-video R@5· 2021-10-21
    56.7
    best: 84.3 (VAST)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonMSR-VTT
    video-to-text Median Rank· 2021-10-21
    3
    best: 16 (JEMC)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonMSR-VTT
    video-to-text R@1· 2021-10-21
    26.7
    best: 64.8 (GRAM)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonMSR-VTT
    video-to-text R@5· 2021-10-21
    56.5
    best: 86.2 (CAMoE)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonLSMDC
    text-to-video R@1· 2021-10-21
    14.9
    best: 46.4 (InternVideo2-6B)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • VideoonLSMDC
    text-to-video R@5· 2021-10-21
    33.2
    best: 80.1 (HunYuan_tvr (huge))
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonActivityNet
    text-to-video R@1· 2021-10-21
    25.4
    best: 74.1 (InternVideo2-6B)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonActivityNet
    text-to-video R@5· 2021-10-21
    59.1
    best: 90.9 (VAST)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonMSR-VTT
    text-to-video Median Rank· 2021-10-21
    3
    best: 55 (C+LSTM+SA+FC7)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonMSR-VTT
    text-to-video R@1· 2021-10-21
    26
    best: 64 (GRAM)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonMSR-VTT
    text-to-video R@5· 2021-10-21
    56.7
    best: 84.3 (VAST)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonMSR-VTT
    video-to-text Median Rank· 2021-10-21
    3
    best: 16 (JEMC)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonMSR-VTT
    video-to-text R@1· 2021-10-21
    26.7
    best: 64.8 (GRAM)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonMSR-VTT
    video-to-text R@5· 2021-10-21
    56.5
    best: 86.2 (CAMoE)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonLSMDC
    text-to-video R@1· 2021-10-21
    14.9
    best: 46.4 (InternVideo2-6B)
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Video RetrievalonLSMDC
    text-to-video R@5· 2021-10-21
    33.2
    best: 80.1 (HunYuan_tvr (huge))
    Video and Text Matching with Conditioned EmbeddingsarXiv:2110.11298
  • Image ClassificationonCUB 200 5-way 1-shot
    Accuracy· 2020-10-07
    79.12
    best: 95.8 (PT+MAP+SF+SOT (transductive))
    Variational Feature Disentangling for Fine-Grained Few-Shot ClassificationarXiv:2010.03255
  • Few-Shot Image ClassificationonCUB 200 5-way 1-shot
    Accuracy· 2020-10-07
    79.12
    best: 95.8 (PT+MAP+SF+SOT (transductive))
    Variational Feature Disentangling for Fine-Grained Few-Shot ClassificationarXiv:2010.03255
  • VideoonDAVIS 2016
    Jaccard (Mean)· 2020-08-04
    83.4
    best: 92.5 (ISVOS (BL30K, MS))
    Learning Discriminative Feature with CRF for Unsupervised Video Object SegmentationarXiv:2008.01270
  • Video Object SegmentationonDAVIS 2016
    Jaccard (Mean)· 2020-08-04
    83.4
    best: 92.5 (ISVOS (BL30K, MS))
    Learning Discriminative Feature with CRF for Unsupervised Video Object SegmentationarXiv:2008.01270
  • 3D Human Pose EstimationonHumanEva-I
    Mean Reconstruction Error (mm)· 2018-08-17
    64
    best: 9.2 (GLA-GCN (T=27, GT))
    Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape EstimationarXiv:1808.05942
  • Pose EstimationonHumanEva-I
    Mean Reconstruction Error (mm)· 2018-08-17
    64
    best: 9.2 (GLA-GCN (T=27, GT))
    Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape EstimationarXiv:1808.05942
  • Shape Representation Of 3D Point CloudsonScanObjectNN
    Mean Accuracy
    87.2
    best: 93.8 (GPSFormer)
  • Shape Representation Of 3D Point CloudsonScanObjectNN
    Overall Accuracy
    89
    best: 97.2 (OmniVec2)
  • 3D Point Cloud ClassificationonScanObjectNN
    Mean Accuracy
    87.2
    best: 93.8 (GPSFormer)
  • 3D Point Cloud ClassificationonScanObjectNN
    Overall Accuracy
    89
    best: 97.2 (OmniVec2)
  • 3D Point Cloud ReconstructiononScanObjectNN
    Mean Accuracy
    87.2
    best: 93.8 (GPSFormer)
  • 3D Point Cloud ReconstructiononScanObjectNN
    Overall Accuracy
    89
    best: 97.2 (OmniVec2)

Medical7 results

  • 3D ClassificationonModelNet40
    Classification Accuracy· 2023-04-27
    93.6
    SOTA
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • 3D Face ModellingonAFLW-LFPA
    NME· 2018-08-14
    3.02
    SOTA
    Hierarchical binary CNNs for landmark localization with limited resourcesarXiv:1808.04803
  • Semantic SegmentationonShapeNet-Part
    Instance Average IoU· 2023-04-27
    86.2
    best: 89.1 (GeomGCNN)
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • Semantic SegmentationonS3DIS Area5
    mAcc
    80.2
    best: 81.6 (Sonata + PTv3)
  • Semantic SegmentationonS3DIS Area5
    mIoU
    73.6
    best: 76 (Sonata + PTv3)
  • Semantic SegmentationonS3DIS Area5
    oAcc
    93
  • Semantic SegmentationonShapeNet-Part
    Instance Average IoU
    88.1
    best: 89.1 (GeomGCNN)

Audio7 results

  • 10-shot image generationonSecond dialogue state tracking challenge
    MAE· 2021-04-16
    0.0377
    SOTA
    Attention! Stay Focus!arXiv:2104.07925
  • 10-shot image generationonShapeNet-Part
    Instance Average IoU· 2023-04-27
    86.2
    best: 89.1 (GeomGCNN)
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • 1 Image, 2*2 StitchionHumanEva-I
    Mean Reconstruction Error (mm)· 2018-08-17
    64
    best: 9.2 (GLA-GCN (T=27, GT))
    Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape EstimationarXiv:1808.05942
  • 10-shot image generationonS3DIS Area5
    mAcc
    80.2
    best: 81.6 (Sonata + PTv3)
  • 10-shot image generationonS3DIS Area5
    mIoU
    73.6
    best: 76 (Sonata + PTv3)
  • 10-shot image generationonS3DIS Area5
    oAcc
    93
  • 10-shot image generationonShapeNet-Part
    Instance Average IoU
    88.1
    best: 89.1 (GeomGCNN)

Methodology5 results

  • 3DonModelNet40
    Classification Accuracy· 2023-04-27
    93.6
    SOTA
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124
  • 2D ClassificationonSecond dialogue state tracking challenge
    MAE· 2021-04-16
    0.0377
    SOTA
    Attention! Stay Focus!arXiv:2104.07925
  • 3DonAFLW-LFPA
    NME· 2018-08-14
    3.02
    SOTA
    Hierarchical binary CNNs for landmark localization with limited resourcesarXiv:1808.04803
  • 3DonAFLW2000-3D
    NME· 2018-08-14
    3.26
    SOTA
    Hierarchical binary CNNs for landmark localization with limited resourcesarXiv:1808.04803
  • 3DonHumanEva-I
    Mean Reconstruction Error (mm)· 2018-08-17
    64
    best: 9.2 (GLA-GCN (T=27, GT))
    Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape EstimationarXiv:1808.05942

Graphs1 result

  • Point Cloud ClassificationonISPRS
    Average F1· 2023-04-27
    82.8
    SOTA
    Exploiting Inductive Bias in Transformer for Point Cloud Classification and SegmentationarXiv:2304.14124

Computer Code1 result

  • Blind Image DeblurringonSecond dialogue state tracking challenge
    MAE· 2021-04-16
    0.0377
    SOTA
    Attention! Stay Focus!arXiv:2104.07925

Natural Language Processing1 result

  • Data-to-Text GenerationonWikipedia Person and Animal Dataset
    BLEU· 2020-05-03
    24.56
    SOTA
    Towards Faithful Neural Table-to-Text Generation with Content-Matching ConstraintsarXiv:2005.00969

Music1 result

  • Facial Recognition and ModellingonAFLW-LFPA
    NME· 2018-08-14
    3.02
    SOTA
    Hierarchical binary CNNs for landmark localization with limited resourcesarXiv:1808.04803

Other1 result

  • Local DistortiononDocUNet
    LD· 2022-03-31
    9.36
    best: 14.08 (DocUNet)
    Revisiting Document Image Dewarping by Grid RegularizationarXiv:2203.16850

Adversarial1 result

  • Text GenerationonWikipedia Person and Animal Dataset
    BLEU· 2020-05-03
    24.56
    best: 25.22 (VTM)
    Towards Faithful Neural Table-to-Text Generation with Content-Matching ConstraintsarXiv:2005.00969