TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Human

Human

Reported on 91 benchmarks across 13 tasks · 10 papers · 21 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing80 results

  • Question AnsweringonCinePile: A Long Video Question Answering Dataset and Benchmark
    Accuracy· 2024-05-14
    86
    SOTA
    CinePile: A Long Video Question Answering Dataset and BenchmarkarXiv:2405.08813
  • Reading ComprehensiononMIntRec
    Accuracy (20 classes)· 2022-09-09
    85.51
    SOTA
    MIntRec: A New Dataset for Multimodal Intent RecognitionarXiv:2209.04355
  • Reading ComprehensiononMIntRec
    Accuracy (Binary)· 2022-09-09
    94.72
    SOTA
    MIntRec: A New Dataset for Multimodal Intent RecognitionarXiv:2209.04355
  • Recognizing Emotion Cause in ConversationsonEmoCause
    Top-1 Recall· 2021-09-18
    41.3
    SOTA
    Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion CausesarXiv:2109.08828
  • Recognizing Emotion Cause in ConversationsonEmoCause
    Top-3 Recall· 2021-09-18
    81.1
    SOTA
    Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion CausesarXiv:2109.08828
  • Recognizing Emotion Cause in ConversationsonEmoCause
    Top-5 Recall· 2021-09-18
    95
    SOTA
    Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion CausesarXiv:2109.08828
  • Visual Question Answering (VQA)onDocVQA test
    ANLS· uses extra data· 2020-07-01
    0.9436
    SOTA
    DocVQA: A Dataset for VQA on Document ImagesarXiv:2007.00398
  • Meme ClassificationonHateful Memes
    Accuracy· 2020-05-10
    0.847
    SOTA
    The Hateful Memes Challenge: Detecting Hate Speech in Multimodal MemesarXiv:2005.04790
  • Meme ClassificationonHateful Memes
    ROC-AUC· 2020-05-10
    0.8265
    best: 0.911 (RA-HMD (Qwen2-VL-7B))
    SOTA
    The Hateful Memes Challenge: Detecting Hate Speech in Multimodal MemesarXiv:2005.04790
  • Word Sense DisambiguationonWiC-TSV
    Task 3 Accuracy: all· 2020-04-30
    85.3
    SOTA
    WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in ContextarXiv:2004.15016
  • Word Sense DisambiguationonWiC-TSV
    Task 3 Accuracy: domain specific· 2020-04-30
    89.2
    SOTA
    WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in ContextarXiv:2004.15016
  • Word Sense DisambiguationonWiC-TSV
    Task 3 Accuracy: general purpose· 2020-04-30
    82.1
    SOTA
    WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in ContextarXiv:2004.15016
  • Entity LinkingonWiC-TSV
    Task 3 Accuracy: all· 2020-04-30
    85.3
    SOTA
    WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in ContextarXiv:2004.15016
  • Entity LinkingonWiC-TSV
    Task 3 Accuracy: domain specific· 2020-04-30
    89.2
    SOTA
    WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in ContextarXiv:2004.15016
  • Entity LinkingonWiC-TSV
    Task 3 Accuracy: general purpose· 2020-04-30
    82.1
    SOTA
    WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in ContextarXiv:2004.15016
  • Question AnsweringonGeometry3K
    Accuracy (%)· 2021-05-10
    56.9
    best: 90.9 (Human Expert)
    Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic ReasoningarXiv:2105.04165
  • Image Captioningonnocaps near-domain
    B1
    77.05
    best: 88.9 (GIT2, Single Model)
  • Image Captioningonnocaps near-domain
    B2
    56.97
    best: 75.86 (GIT2, Single Model)
  • Image Captioningonnocaps near-domain
    B3
    36.84
    best: 58.99 (PaLI)
  • Image Captioningonnocaps near-domain
    B4
    19.85
    best: 39.98 (PaLI)
  • Image Captioningonnocaps near-domain
    CIDEr
    84.58
    best: 125.51 (GIT2, Single Model)
  • Image Captioningonnocaps near-domain
    METEOR
    28.42
    best: 33.47 (PaLI)
  • Image Captioningonnocaps near-domain
    ROUGE-L
    53.06
    best: 63.99 (PaLI)
  • Image Captioningonnocaps near-domain
    SPICE
    14.72
    best: 16.11 (GIT2, Single Model)
  • Image Captioningonnocaps entire
    B1
    76.64
    best: 88.1 (GIT, Single Model)
  • Image Captioningonnocaps entire
    B2
    56.46
    best: 74.81 (GIT, Single Model)
  • Image Captioningonnocaps entire
    B3
    36.37
    best: 57.68 (GIT, Single Model)
  • Image Captioningonnocaps entire
    B4
    19.48
    best: 37.71 (CoCa - Google Brain)
  • Image Captioningonnocaps entire
    CIDEr
    85.34
    best: 126.8 (Lyrics)
  • Image Captioningonnocaps entire
    METEOR
    28.15
    best: 32.5 (GIT, Single Model)
  • Image Captioningonnocaps entire
    ROUGE-L
    52.83
    best: 63.12 (GIT, Single Model)
  • Image Captioningonnocaps entire
    SPICE
    14.67
    best: 15.94 (GIT, Single Model)
  • Image Captioningonnocaps out-of-domain
    B1
    74.84
    best: 86.28 (PaLI)
  • Image Captioningonnocaps out-of-domain
    B2
    53.9
    best: 71.28 (GIT, Single Model)
  • Image Captioningonnocaps out-of-domain
    B3
    33.51
    best: 52.66 (GIT, Single Model)
  • Image Captioningonnocaps out-of-domain
    B4
    16.6
    best: 32 (PaLI)
  • Image Captioningonnocaps out-of-domain
    CIDEr
    91.62
    best: 126.67 (PaLI)
  • Image Captioningonnocaps out-of-domain
    METEOR
    26.83
    best: 30.99 (PaLI)
  • Image Captioningonnocaps out-of-domain
    ROUGE-L
    51.5
    best: 61.35 (PaLI)
  • Image Captioningonnocaps out-of-domain
    SPICE
    14.21
    best: 15.7 (GIT, Single Model)
  • Image Captioningonnocaps-XD in-domain
    B1
    76.89
    best: 88.86 (GIT2)
  • Image Captioningonnocaps-XD in-domain
    B2
    57.3
    best: 76.1 (GIT)
  • Image Captioningonnocaps-XD in-domain
    B3
    37.78
    best: 60.53 (GIT)
  • Image Captioningonnocaps-XD in-domain
    B4
    21.49
    best: 41.65 (GIT)
  • Image Captioningonnocaps-XD in-domain
    CIDEr
    80.61
    best: 124.18 (GIT2)
  • Image Captioningonnocaps-XD in-domain
    METEOR
    28.53
    best: 33.83 (GIT2)
  • Image Captioningonnocaps-XD in-domain
    ROUGE-L
    53.47
    best: 64.02 (GIT)
  • Image Captioningonnocaps-XD in-domain
    SPICE
    14.99
    best: 16.36 (GIT2)
  • Image Captioningonnocaps in-domain
    B1
    76.89
    best: 88.86 (GIT2, Single Model)
  • Image Captioningonnocaps in-domain
    B2
    57.3
    best: 76.1 (GIT, Single Model)
  • Image Captioningonnocaps in-domain
    B3
    37.78
    best: 60.53 (GIT, Single Model)
  • Image Captioningonnocaps in-domain
    B4
    21.49
    best: 41.65 (GIT, Single Model)
  • Image Captioningonnocaps in-domain
    CIDEr
    80.61
    best: 149.1 (PaLI)
  • Image Captioningonnocaps in-domain
    METEOR
    28.53
    best: 34.22 (PaLI)
  • Image Captioningonnocaps in-domain
    ROUGE-L
    53.47
    best: 64.39 (PaLI)
  • Image Captioningonnocaps in-domain
    SPICE
    14.99
    best: 16.36 (GIT2, Single Model)
  • Image Captioningonnocaps-XD near-domain
    B1
    77.05
    best: 88.9 (GIT2)
  • Image Captioningonnocaps-XD near-domain
    B2
    56.97
    best: 75.86 (GIT2)
  • Image Captioningonnocaps-XD near-domain
    B3
    36.84
    best: 58.9 (GIT2)
  • Image Captioningonnocaps-XD near-domain
    B4
    19.85
    best: 38.95 (GIT2)
  • Image Captioningonnocaps-XD near-domain
    CIDEr
    84.58
    best: 125.51 (GIT2)
  • Image Captioningonnocaps-XD near-domain
    METEOR
    28.42
    best: 32.95 (GIT2)
  • Image Captioningonnocaps-XD near-domain
    ROUGE-L
    53.06
    best: 63.66 (GIT2)
  • Image Captioningonnocaps-XD near-domain
    SPICE
    14.72
    best: 16.11 (GIT2)
  • Image Captioningonnocaps-XD entire
    B1
    76.64
    best: 88.43 (GIT2)
  • Image Captioningonnocaps-XD entire
    B2
    56.46
    best: 75.02 (GIT2)
  • Image Captioningonnocaps-XD entire
    B3
    36.37
    best: 57.87 (GIT2)
  • Image Captioningonnocaps-XD entire
    B4
    19.48
    best: 37.65 (GIT2)
  • Image Captioningonnocaps-XD entire
    CIDEr
    85.34
    best: 124.77 (GIT2)
  • Image Captioningonnocaps-XD entire
    METEOR
    28.15
    best: 32.56 (GIT2)
  • Image Captioningonnocaps-XD entire
    ROUGE-L
    52.83
    best: 63.19 (GIT2)
  • Image Captioningonnocaps-XD entire
    SPICE
    14.67
    best: 16.06 (GIT2)
  • Image Captioningonnocaps-XD out-of-domain
    B1
    74.84
    best: 86.28 (GIT2)
  • Image Captioningonnocaps-XD out-of-domain
    B2
    53.9
    best: 71.28 (GIT)
  • Image Captioningonnocaps-XD out-of-domain
    B3
    33.51
    best: 52.66 (GIT)
  • Image Captioningonnocaps-XD out-of-domain
    B4
    16.6
    best: 30.15 (GIT2)
  • Image Captioningonnocaps-XD out-of-domain
    CIDEr
    91.62
    best: 122.27 (GIT2)
  • Image Captioningonnocaps-XD out-of-domain
    METEOR
    26.83
    best: 30.45 (GIT)
  • Image Captioningonnocaps-XD out-of-domain
    ROUGE-L
    51.5
    best: 60.96 (GIT)
  • Image Captioningonnocaps-XD out-of-domain
    SPICE
    14.21
    best: 15.7 (GIT)

Reasoning7 results

  • Video Question AnsweringonCinePile: A Long Video Question Answering Dataset and Benchmark
    Accuracy· 2024-05-14
    86
    SOTA
    CinePile: A Long Video Question Answering Dataset and BenchmarkarXiv:2405.08813
  • Visual ReasoningonBongard-OpenWorld
    2-Class Accuracy· 2023-10-16
    91
    best: 93.6 (Gemini-2.0 + CA)
    SOTA
    Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real WorldarXiv:2310.10207
  • Visual ReasoningonGD-VCR
    Accuracy· 2021-09-14
    88.84
    SOTA
    Broaden the Vision: Geo-Diverse Visual Commonsense ReasoningarXiv:2109.06860
  • Video Question AnsweringonIntentQA
    Accuarcy
    78.5
    best: 83.4 (VideoChat2_HD_mistral)
  • Video Question AnsweringonIntentQA
    CH
    80.2
    best: 90 (VideoChat2_HD_mistral)
  • Video Question AnsweringonIntentQA
    CW
    77.8
    best: 84 (VideoChat2_HD_mistral)
  • Video Question AnsweringonIntentQA
    TP&TN
    79.1

Miscellaneous2 results

  • Intent RecognitiononMIntRec
    Accuracy (20 classes)· 2022-09-09
    85.51
    SOTA
    MIntRec: A New Dataset for Multimodal Intent RecognitionarXiv:2209.04355
  • Intent RecognitiononMIntRec
    Accuracy (Binary)· 2022-09-09
    94.72
    SOTA
    MIntRec: A New Dataset for Multimodal Intent RecognitionarXiv:2209.04355

Computer Vision1 result

  • Spatial Relation RecognitiononRel3D
    Acc· 2020-12-03
    94.25
    SOTA
    Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3DarXiv:2012.01634

Knowledge Base1 result

  • Mathematical Question AnsweringonGeometry3K
    Accuracy (%)· 2021-05-10
    56.9
    best: 90.9 (Human Expert)
    Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic ReasoningarXiv:2105.04165