Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/GPT-2

GPT-2

Reported on 57 benchmarks across 12 tasks · 2 papers

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing31 results

Text ClassificationonRAFT
Over· 2021-09-28
0.498
best: 0.95 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
ADE· 2021-09-28
0.6
best: 0.83 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
Avg· 2021-09-28
0.458
best: 0.758 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
B77· 2021-09-28
0.121
best: 0.695 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
NIS· 2021-09-28
0.561
best: 0.857 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
OSE· 2021-09-28
0.245
best: 0.676 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
SOT· 2021-09-28
0.38
best: 0.915 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
SRI· 2021-09-28
0.492
best: 0.516 (GPT-3)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
TAI· 2021-09-28
0.612
best: 0.736 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
TC· 2021-09-28
0.723
best: 0.897 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
TEH· 2021-09-28
0.311
best: 0.722 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Text ClassificationonRAFT
ToS· 2021-09-28
0.498
best: 0.75 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
Over· 2021-09-28
0.498
best: 0.95 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
ADE· 2021-09-28
0.6
best: 0.83 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
Avg· 2021-09-28
0.458
best: 0.758 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
B77· 2021-09-28
0.121
best: 0.695 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
NIS· 2021-09-28
0.561
best: 0.857 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
OSE· 2021-09-28
0.245
best: 0.676 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
SOT· 2021-09-28
0.38
best: 0.915 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
SRI· 2021-09-28
0.492
best: 0.516 (GPT-3)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
TAI· 2021-09-28
0.612
best: 0.736 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
TC· 2021-09-28
0.723
best: 0.897 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
TEH· 2021-09-28
0.311
best: 0.722 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Few-Shot Text ClassificationonRAFT
ToS· 2021-09-28
0.498
best: 0.75 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
Cross-LingualonReddit Ideological and Extreme Bias Dataset
weighted-F1 score
76.43
best: 79.1 (SVM)
Text ClassificationonThreatGram 101 - Extreme Telegram Data
weighted-F1 score
66.2
Cross-Lingual Document ClassificationonReddit Ideological and Extreme Bias Dataset
weighted-F1 score
76.43
best: 79.1 (SVM)
Document SummarizationonCNN / Daily Mail
ROUGE-1· uses extra data
29.34
best: 48.18 (Scrambled code + broken (alter))
Document SummarizationonCNN / Daily Mail
ROUGE-2· uses extra data
8.27
best: 22.55 (PEGASUS + SummaReranker)
Document SummarizationonCNN / Daily Mail
ROUGE-L· uses extra data
26.58
best: 45.35 (Scrambled code + broken (alter))
Response GenerationonSIMMC2.0
BLEU
19.2
best: 34.1 (PaCE)

Methodology17 results

Data MiningonIMDb Movie Reviews
Accuracy· 2023-08-07
54.5
best: 95.6 (ELECTRA)
Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion Mining arXiv:2308.03235
Data MiningonIMDb Movie Reviews
F1· 2023-08-07
52.9
best: 95.6 (ELECTRA)
Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion Mining arXiv:2308.03235
Interpretable Machine LearningonIMDb Movie Reviews
Accuracy· 2023-08-07
54.5
best: 95.6 (ELECTRA)
Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion Mining arXiv:2308.03235
Interpretable Machine LearningonIMDb Movie Reviews
F1· 2023-08-07
52.9
best: 95.6 (ELECTRA)
Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion Mining arXiv:2308.03235
ClassificationonRAFT
Over· 2021-09-28
0.498
best: 0.95 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
ADE· 2021-09-28
0.6
best: 0.83 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
Avg· 2021-09-28
0.458
best: 0.758 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
B77· 2021-09-28
0.121
best: 0.695 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
NIS· 2021-09-28
0.561
best: 0.857 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
OSE· 2021-09-28
0.245
best: 0.676 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
SOT· 2021-09-28
0.38
best: 0.915 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
SRI· 2021-09-28
0.492
best: 0.516 (GPT-3)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
TAI· 2021-09-28
0.612
best: 0.736 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
TC· 2021-09-28
0.723
best: 0.897 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
TEH· 2021-09-28
0.311
best: 0.722 (Human (crowdsourced))
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonRAFT
ToS· 2021-09-28
0.498
best: 0.75 (T-Few)
RAFT: A Real-World Few-Shot Text Classification Benchmark arXiv:2109.14076
ClassificationonThreatGram 101 - Extreme Telegram Data
weighted-F1 score
66.2

Medical4 results

Language ModellingonPenn Treebank (Word Level)
Test perplexity· uses extra data
35.76
best: 20.5 (GPT-3 (Zero-Shot))
Language ModellingonText8
Bit per Character (BPC)· uses extra data
0.98
best: 1.63 (td-LSTM (Zhang et al., 2016))
Language ModellingonOne Billion Word
PPL· uses extra data
42.16
best: 20.09 (MDLM (AR baseline))
Language ModellingonWikiText-2
Test perplexity· uses extra data
18.34
best: 8.21 (SparseGPT (175B, 50% Sparsity))

Knowledge Base3 results

Text SummarizationonCNN / Daily Mail
ROUGE-1· uses extra data
29.34
best: 48.18 (Scrambled code + broken (alter))
Text SummarizationonCNN / Daily Mail
ROUGE-2· uses extra data
8.27
best: 24.02 (Pegasus)
Text SummarizationonCNN / Daily Mail
ROUGE-L· uses extra data
26.58
best: 45.35 (Scrambled code + broken (alter))

Speech2 results

DialogueonSIMMC2.0
Act F1
94.5
best: 97.1 (PaCE)
DialogueonSIMMC2.0
Slot F1
81.7
best: 88.3 (BART-large)