TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/BERT

BERT

Reported on 139 benchmarks across 45 tasks · 24 papers · 60 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing88 results

  • Text ClassificationonTREC-10
    Accuracy· 2022-11-30
    99.4
    SOTA
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • Text ClassificationonNICE-45
    Accuracy· 2022-11-30
    72.79
    SOTA
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • Czech Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    99.22
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Vietnamese Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    98.53
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Romanian Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    98.64
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Slovak Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    99.32
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Latvian Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    98.63
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Irish Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    98.88
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Hungarian Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    99.41
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • French Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    99.71
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Turkish Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    98.95
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Spanish Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    99.62
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Croatian Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    99.73
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408
  • Relation ExtractiononDDRel
    Pair-level 13-class Acc· 2020-12-04
    39.73
    SOTA
    DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic DialoguesarXiv:2012.02553
  • Relation ExtractiononDDRel
    Pair-level 4-class Acc· 2020-12-04
    58.13
    SOTA
    DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic DialoguesarXiv:2012.02553
  • Relation ExtractiononDDRel
    Pair-level 6-class Acc· 2020-12-04
    42.33
    SOTA
    DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic DialoguesarXiv:2012.02553
  • Relation ExtractiononDDRel
    Session-level 13-class Acc· 2020-12-04
    39.4
    SOTA
    DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic DialoguesarXiv:2012.02553
  • Relation ExtractiononDDRel
    Session-level 4-class Acc· 2020-12-04
    47.1
    SOTA
    DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic DialoguesarXiv:2012.02553
  • Relation ExtractiononDDRel
    Session-level 6-class Acc· 2020-12-04
    41.87
    SOTA
    DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic DialoguesarXiv:2012.02553
  • Abuse DetectiononEthos Binary
    F1-score· 2020-06-11
    0.7883
    best: 0.7971 (BiLSTM + static BE)
    SOTA
    ETHOS: an Online Hate Speech Detection DatasetarXiv:2006.08328
  • Abuse DetectiononEthos Binary
    Precision· 2020-06-11
    79.17
    SOTA
    ETHOS: an Online Hate Speech Detection DatasetarXiv:2006.08328
  • Hate Speech DetectiononEthos Binary
    F1-score· 2020-06-11
    0.7883
    best: 0.7971 (BiLSTM + static BE)
    SOTA
    ETHOS: an Online Hate Speech Detection DatasetarXiv:2006.08328
  • Hate Speech DetectiononEthos Binary
    Precision· 2020-06-11
    79.17
    SOTA
    ETHOS: an Online Hate Speech Detection DatasetarXiv:2006.08328
  • Reading ComprehensiononCrowdSource QA
    MSE· 2020-02-24
    0.046
    SOTA
    Predicting Subjective Features of Questions of QA Websites using BERTarXiv:2002.10107
  • Question AnsweringonCrowdSource QA
    MSE· 2020-02-24
    0.046
    SOTA
    Predicting Subjective Features of Questions of QA Websites using BERTarXiv:2002.10107
  • Common Sense ReasoningonCrowdSource QA
    MSE· 2020-02-24
    0.046
    SOTA
    Predicting Subjective Features of Questions of QA Websites using BERTarXiv:2002.10107
  • Conversational Response SelectiononDouban
    R10@5· 2019-08-13
    0.828
    best: 0.877 (SEMSOL(W/o utterances))
    SOTA
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS
    MAP· 2019-08-13
    0.625
    best: 0.702 (BERT-FP)
    SOTA
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS
    MRR· 2019-08-13
    0.639
    best: 0.715 (SA-BERT+BERT-FP)
    SOTA
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS
    P@1· 2019-08-13
    0.453
    best: 0.555 (SA-BERT+BERT-FP)
    SOTA
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS
    R10@1· 2019-08-13
    0.404
    best: 0.497 (SA-BERT+BERT-FP)
    SOTA
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS
    R10@2· 2019-08-13
    0.606
    best: 0.708 (BERT-FP)
    SOTA
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS
    R10@5· 2019-08-13
    0.875
    best: 0.931 (SA-BERT+BERT-FP)
    SOTA
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Text ClassificationonWeeBit (Readability Assessment)
    Accuracy (5-fold)· 2019-07-26
    0.857
    best: 0.927 (BERT-FP-LBL)
    SOTA
    Supervised and Unsupervised Neural Approaches to Text ReadabilityarXiv:1907.11779
  • Relation ExtractiononDiscovery
    1:1 Accuracy· 2019-03-28
    20.6
    SOTA
    Mining Discourse Markers for Unsupervised Sentence Representation LearningarXiv:1903.11850
  • Relation ClassificationonDiscovery
    1:1 Accuracy· 2019-03-28
    20.6
    SOTA
    Mining Discourse Markers for Unsupervised Sentence Representation LearningarXiv:1903.11850
  • Reading ComprehensiononPhotoChat
    F1· 2018-10-11
    53.2
    best: 63.8 (PaCE)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Reading ComprehensiononPhotoChat
    Precision· 2018-10-11
    56.1
    best: 63.3 (PaCE)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Reading ComprehensiononPhotoChat
    Recall· 2018-10-11
    50.6
    best: 68 (PaCE)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Question AnsweringonMultiTQ
    Hits@1· 2018-10-11
    8.3
    best: 79.7 (Prog-TQA)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Question AnsweringonMultiTQ
    Hits@10· 2018-10-11
    48.2
    best: 91 (Prog-TQA)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Text ClassificationonUK Key Stage Readability
    F1· 2024-11-26
    75
    best: 99.6 (ELECTRA + ANN)
    What Differentiates Educational Literature? A Multimodal Fusion Approach of Transformers and Computational LinguisticsarXiv:2411.17593
  • Text ClassificationonR8
    Accuracy· 2022-11-30
    98.171
    best: 98.451 (DeBERTa)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • Text ClassificationonSearchsnippets
    Accuracy· 2022-11-30
    88.2
    best: 89.69 (DistilBERT)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • Text ClassificationonSST-2
    Accuracy· 2022-11-30
    91.37
    best: 94.78 (DeBERTa)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • Text ClassificationonMR
    Accuracy· 2022-11-30
    86.94
    best: 93.3 (VLAWE)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • Text ClassificationonTwitter
    Accuracy· 2022-11-30
    99.96
    best: 99.97 (ERNIE 2.0)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • Semantic Textual SimilarityonMTEB
    Spearman Correlation· 2022-10-13
    54.36
    best: 84.54 (AnglE-UAE)
    MTEB: Massive Text Embedding BenchmarkarXiv:2210.07316
  • Text ClusteringonMTEB
    V-Measure· 2022-10-13
    30.12
    best: 43.71 (ST5-XXL)
    MTEB: Massive Text Embedding BenchmarkarXiv:2210.07316
  • Text ClassificationonMTEB
    Accuracy· 2022-10-13
    61.66
    best: 73.42 (ST5-XXL)
    MTEB: Massive Text Embedding BenchmarkarXiv:2210.07316
  • Relation ExtractiononSemEval-2010 Task-8
    F1· 2022-08-20
    89.4
    best: 91.9 (SP)
    SPOT: Knowledge-Enhanced Language Representations for Information ExtractionarXiv:2208.09625
  • Text ClassificationonGLUE SST2
    Accuracy· 2022-05-15
    92.0872
    best: 94.38 (TRANS-BLSTM)
    Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual NetworksarXiv:2205.07260
  • Visual Question Answering (VQA)onQLEVR
    Overall Accuracy· 2022-05-06
    65.8
    best: 66.5 (MAC)
    QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual ReasoningarXiv:2205.03075
  • Question AnsweringonCronQuestions
    Hits@1· 2021-12-10
    24.3
    best: 97.8 (GenTKGQA)
    TempoQR: Temporal Question Reasoning over Knowledge GraphsarXiv:2112.05785
  • Natural Language UnderstandingonLexGLUE
    CaseHOLD· 2021-10-03
    70.7
    best: 75.6 (CaseLaw-BERT)
    LexGLUE: A Benchmark Dataset for Legal Language Understanding in EnglisharXiv:2110.00976
  • Question AnsweringonCaseHOLD
    Macro F1 (10-fold)· 2021-04-18
    61.3
    best: 69.5 (Custom Legal-BERT)
    When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD DatasetarXiv:2104.08671
  • Text ClassificationonTerms of Service
    F1(10-fold)· 2021-04-18
    72.2
    best: 78.7 (Custom Legal-BERT)
    When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD DatasetarXiv:2104.08671
  • Text ClassificationonOverruling
    F1(10-fold)· 2021-04-18
    95.8
    best: 97.4 (Custom Legal-BERT)
    When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD DatasetarXiv:2104.08671
  • Abuse DetectiononHatEval
    Macro F1· 2020-10-23
    0.48
    best: 0.494 (HateBERT)
    HateBERT: Retraining BERT for Abusive Language Detection in EnglisharXiv:2010.12472
  • Abuse DetectiononAbusEval
    Macro F1· 2020-10-23
    0.724
    best: 0.742 (HateBERT)
    HateBERT: Retraining BERT for Abusive Language Detection in EnglisharXiv:2010.12472
  • Abuse DetectiononOffensEval 2019
    Macro F1· 2020-10-23
    0.803
    best: 0.805 (HateBERT)
    HateBERT: Retraining BERT for Abusive Language Detection in EnglisharXiv:2010.12472
  • Hate Speech DetectiononHatEval
    Macro F1· 2020-10-23
    0.48
    best: 0.494 (HateBERT)
    HateBERT: Retraining BERT for Abusive Language Detection in EnglisharXiv:2010.12472
  • Hate Speech DetectiononAbusEval
    Macro F1· 2020-10-23
    0.724
    best: 0.742 (HateBERT)
    HateBERT: Retraining BERT for Abusive Language Detection in EnglisharXiv:2010.12472
  • Hate Speech DetectiononOffensEval 2019
    Macro F1· 2020-10-23
    0.803
    best: 0.805 (HateBERT)
    HateBERT: Retraining BERT for Abusive Language Detection in EnglisharXiv:2010.12472
  • Abuse DetectiononEthos Binary
    Classification Accuracy· 2020-06-11
    0.7664
    best: 0.8015 (BiLSTM + static BE)
    ETHOS: an Online Hate Speech Detection DatasetarXiv:2006.08328
  • Hate Speech DetectiononEthos Binary
    Classification Accuracy· 2020-06-11
    0.7664
    best: 0.8015 (BiLSTM + static BE)
    ETHOS: an Online Hate Speech Detection DatasetarXiv:2006.08328
  • Question AnsweringonCrowdSource QA
    MSE· 2020-02-24
    0.046
    Predicting Subjective Features of Questions of QA Websites using BERTarXiv:2002.10107
  • Conversational Response SelectiononDouban
    MAP· 2019-08-13
    0.591
    best: 0.651 (SEMSOL(W/o utterances))
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononDouban
    MRR· 2019-08-13
    0.633
    best: 0.688 (Uni-Enc+BERT-FP)
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononDouban
    P@1· 2019-08-13
    0.454
    best: 0.518 (Uni-Enc+BERT-FP)
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononDouban
    R10@1· 2019-08-13
    0.28
    best: 0.33 (SEMSOL)
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononDouban
    R10@2· 2019-08-13
    0.47
    best: 0.557 (Uni-Enc+BERT-FP)
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS Ranking Test
    NDCG@3· 2019-08-13
    0.625
    best: 0.679 (Poly-encoder)
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Conversational Response SelectiononRRS Ranking Test
    NDCG@5· 2019-08-13
    0.714
    best: 0.765 (Poly-encoder)
    An Effective Domain Adaptive Post-Training Method for BERT in Response SelectionarXiv:1908.04812
  • Relation ExtractiononTACRED
    F1· 2019-05-17
    66
    best: 86.6 (RAG4RE)
    ERNIE: Enhanced Language Representation with Informative EntitiesarXiv:1905.07129
  • Relation ClassificationonTACRED
    F1· 2019-05-17
    66
    best: 76.8 (DeepStruct multi-task w/ finetune)
    ERNIE: Enhanced Language Representation with Informative EntitiesarXiv:1905.07129
  • Question AnsweringonDROP Test
    F1· 2019-03-01
    32.7
    best: 88.38 (QDGAT (ensemble))
    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over ParagraphsarXiv:1903.00161
  • Sarcasm DetectiononFigLang 2020 Twitter Dataset
    F1
    0.731
    best: 0.772 (RoBERTa_large (Context-Response))
  • Entity ResolutiononWDC Computers-xlarge
    F1 (%)· uses extra data
    97.37
    best: 98.33 (RoBERTa-SupCon)
  • Entity ResolutiononWDC Computers-small
    F1 (%)· uses extra data
    96.53
  • Text ClassificationonTRAC2-English. Task2.
    F1
    0.871585052
  • Text ClassificationonTRAC2-Benghali. Task 2.
    F1
    0.929702403
  • Abstractive Text SummarizationonEDUsum
    ROUGE-1
    62.37
    best: 64.48 (GP_Step_Sim)
  • Abstractive Text SummarizationonEDUsum
    ROUGE-2
    50.7
    best: 52.7 (GP_Step_Sim)
  • Abstractive Text SummarizationonEDUsum
    ROUGE-L
    59.4
    best: 61.91 (GP_Step_Sim)
  • Spam detectiononTraditional and Context-specific Spam Twitter
    Avg F1· uses extra data
    0.8553
    best: 0.9079
  • Spam detectiononTraditional and Context-specific Spam Twitter
    Avg F1· uses extra data
    0.9079
  • Spam detectiononTraditional and Context-specific Spam Twitter
    Avg F1· uses extra data
    0.8408
    best: 0.9079

Methodology25 results

  • ClassificationonTREC-10
    Accuracy· 2022-11-30
    99.4
    SOTA
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • ClassificationonNICE-45
    Accuracy· 2022-11-30
    72.79
    SOTA
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • How To Refund A Wrong Transaction In Phonepe onCrowdSource QA
    MSE· 2020-02-24
    0.046
    SOTA
    Predicting Subjective Features of Questions of QA Websites using BERTarXiv:2002.10107
  • ClassificationonWeeBit (Readability Assessment)
    Accuracy (5-fold)· 2019-07-26
    0.857
    best: 0.927 (BERT-FP-LBL)
    SOTA
    Supervised and Unsupervised Neural Approaches to Text ReadabilityarXiv:1907.11779
  • ClassificationonUK Key Stage Readability
    F1· 2024-11-26
    75
    best: 99.6 (ELECTRA + ANN)
    What Differentiates Educational Literature? A Multimodal Fusion Approach of Transformers and Computational LinguisticsarXiv:2411.17593
  • Data MiningonIMDb Movie Reviews
    Accuracy· 2023-08-07
    94
    best: 95.6 (ELECTRA)
    Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion MiningarXiv:2308.03235
  • Data MiningonIMDb Movie Reviews
    F1· 2023-08-07
    94.1
    best: 95.6 (ELECTRA)
    Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion MiningarXiv:2308.03235
  • Interpretable Machine LearningonIMDb Movie Reviews
    Accuracy· 2023-08-07
    94
    best: 95.6 (ELECTRA)
    Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion MiningarXiv:2308.03235
  • Interpretable Machine LearningonIMDb Movie Reviews
    F1· 2023-08-07
    94.1
    best: 95.6 (ELECTRA)
    Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion MiningarXiv:2308.03235
  • ClassificationonR8
    Accuracy· 2022-11-30
    98.171
    best: 98.451 (DeBERTa)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • ClassificationonSearchsnippets
    Accuracy· 2022-11-30
    88.2
    best: 89.69 (DistilBERT)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • ClassificationonSST-2
    Accuracy· 2022-11-30
    91.37
    best: 94.78 (DeBERTa)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • ClassificationonMR
    Accuracy· 2022-11-30
    86.94
    best: 93.3 (VLAWE)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • ClassificationonTwitter
    Accuracy· 2022-11-30
    99.96
    best: 99.97 (ERNIE 2.0)
    Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world DatasetsarXiv:2211.16878
  • RetrievalonMTEB
    nDCG@10· 2022-10-13
    10.59
    best: 50.25 (SGPT-5.8B-msmarco)
    MTEB: Massive Text Embedding BenchmarkarXiv:2210.07316
  • ClassificationonMTEB
    Accuracy· 2022-10-13
    61.66
    best: 73.42 (ST5-XXL)
    MTEB: Massive Text Embedding BenchmarkarXiv:2210.07316
  • Anomaly DetectiononAnoShift
    ROC-AUC FAR· 2022-06-30
    28.15
    best: 62.5 (ACR-NTL (zero-shot, test anomaly ratio=1%))
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476
  • Anomaly DetectiononAnoShift
    ROC-AUC IID· 2022-06-30
    84.54
    best: 92.67 (deepSVDD)
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476
  • Anomaly DetectiononAnoShift
    ROC-AUC NEAR· 2022-06-30
    86.05
    best: 87 (deepSVDD)
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476
  • Anomaly DetectiononAnoShift
    ROC-AUC-ID (In-Distribution setup)· 2022-06-30
    79.62
    best: 88.24 (deepSVDD)
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476
  • ClassificationonGLUE SST2
    Accuracy· 2022-05-15
    92.0872
    best: 94.38 (TRANS-BLSTM)
    Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual NetworksarXiv:2205.07260
  • ClassificationonTerms of Service
    F1(10-fold)· 2021-04-18
    72.2
    best: 78.7 (Custom Legal-BERT)
    When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD DatasetarXiv:2104.08671
  • ClassificationonOverruling
    F1(10-fold)· 2021-04-18
    95.8
    best: 97.4 (Custom Legal-BERT)
    When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD DatasetarXiv:2104.08671
  • ClassificationonTRAC2-English. Task2.
    F1
    0.871585052
  • ClassificationonTRAC2-Benghali. Task 2.
    F1
    0.929702403

Computer Code8 results

  • Program SynthesisonManyTypes4TypeScript
    Average Accuracy· 2018-10-11
    57.52
    best: 71.27 (CodeTIDAL5)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Program SynthesisonManyTypes4TypeScript
    Average F1· 2018-10-11
    54.1
    best: 60.57 (GraphCodeBERT)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Program SynthesisonManyTypes4TypeScript
    Average Precision· 2018-10-11
    54.18
    best: 60.06 (GraphCodeBERT)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Program SynthesisonManyTypes4TypeScript
    Average Recall· 2018-10-11
    54.02
    best: 61.08 (GraphCodeBERT)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Type predictiononManyTypes4TypeScript
    Average Accuracy· 2018-10-11
    57.52
    best: 71.27 (CodeTIDAL5)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Type predictiononManyTypes4TypeScript
    Average F1· 2018-10-11
    54.1
    best: 60.57 (GraphCodeBERT)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Type predictiononManyTypes4TypeScript
    Average Precision· 2018-10-11
    54.18
    best: 60.06 (GraphCodeBERT)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Type predictiononManyTypes4TypeScript
    Average Recall· 2018-10-11
    54.02
    best: 61.08 (GraphCodeBERT)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805

Knowledge Base7 results

  • 2D Human Pose EstimationonCrowdSource QA
    MSE· 2020-02-24
    0.046
    SOTA
    Predicting Subjective Features of Questions of QA Websites using BERTarXiv:2002.10107
  • Text SummarizationonMTEB
    Spearman Correlation· 2022-10-13
    29.82
    best: 31.57 (MPNet-multilingual)
    MTEB: Massive Text Embedding BenchmarkarXiv:2210.07316
  • Data IntegrationonWDC Computers-xlarge
    F1 (%)· uses extra data
    97.37
    best: 98.33 (RoBERTa-SupCon)
  • Data IntegrationonWDC Computers-small
    F1 (%)· uses extra data
    96.53
  • Text SummarizationonEDUsum
    ROUGE-1
    62.37
    best: 64.48 (GP_Step_Sim)
  • Text SummarizationonEDUsum
    ROUGE-2
    50.7
    best: 52.7 (GP_Step_Sim)
  • Text SummarizationonEDUsum
    ROUGE-L
    59.4
    best: 61.91 (GP_Step_Sim)

Audio6 results

  • 10-shot image generationonCrowdSource QA
    MSE· 2020-02-24
    0.046
    SOTA
    Predicting Subjective Features of Questions of QA Websites using BERTarXiv:2002.10107
  • Text-To-Speech SynthesisonHelsinki Prosody Corpus
    Accuracy· 2019-08-06
    83.2
    SOTA
    Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word RepresentationsarXiv:1908.02262
  • Emotion RecognitiononEmoWoz
    Macro F1· 2021-09-10
    55.8
    best: 65.33 (CD-ERC)
    EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion Recognition in Task-Oriented Dialogue SystemsarXiv:2109.04919
  • Emotion RecognitiononEmoWoz
    Macro F1 (w/o Neutral)· 2021-09-10
    50.14
    best: 56.34 (COSMIC)
    EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion Recognition in Task-Oriented Dialogue SystemsarXiv:2109.04919
  • Emotion RecognitiononEmoWoz
    Weighted F1· 2021-09-10
    84.83
    best: 88.33 (ContextBERT)
    EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion Recognition in Task-Oriented Dialogue SystemsarXiv:2109.04919
  • Emotion RecognitiononEmoWoz
    Weighted F1 (w/o Neutral)· 2021-09-10
    73.55
    best: 79.67 (ContextBERT)
    EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion Recognition in Task-Oriented Dialogue SystemsarXiv:2109.04919

Graphs4 results

  • Unsupervised Anomaly DetectiononAnoShift
    ROC-AUC FAR· 2022-06-30
    28.15
    best: 62.5 (ACR-NTL (zero-shot, test anomaly ratio=1%))
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476
  • Unsupervised Anomaly DetectiononAnoShift
    ROC-AUC IID· 2022-06-30
    84.54
    best: 92.67 (deepSVDD)
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476
  • Unsupervised Anomaly DetectiononAnoShift
    ROC-AUC NEAR· 2022-06-30
    86.05
    best: 87 (deepSVDD)
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476
  • Unsupervised Anomaly DetectiononAnoShift
    ROC-AUC-ID (In-Distribution setup)· 2022-06-30
    79.62
    best: 88.24 (deepSVDD)
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly DetectionarXiv:2206.15476

Miscellaneous3 results

  • Intent RecognitiononPhotoChat
    F1· 2018-10-11
    53.2
    best: 63.8 (PaCE)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Intent RecognitiononPhotoChat
    Precision· 2018-10-11
    56.1
    best: 63.3 (PaCE)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805
  • Intent RecognitiononPhotoChat
    Recall· 2018-10-11
    50.6
    best: 68 (PaCE)
    SOTA
    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv:1810.04805

Other1 result

  • Polish Text DiacritizationonMultilingual Dataset for Training and Evaluating Diacritics Restoration Systems
    Alpha-Word accuracy· 2021-05-24
    99.66
    SOTA
    Diacritics Restoration using BERT with Analysis on Czech languagearXiv:2105.11408