TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Chinchilla-70B (few-shot, k=5)

Chinchilla-70B (few-shot, k=5)

Reported on 37 benchmarks across 16 tasks · 1 paper · 31 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing34 results

  • Reading ComprehensiononBIG-bench
    Accuracy· 2022-03-29
    78
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    94
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Question AnsweringonBIG-bench (Novel Concepts)
    Accuracy· 2022-03-29
    65.6
    best: 71.9 (PaLM-540B (few-shot, k=5))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Question AnsweringonBIG-bench (Movie Recommendation)
    Accuracy· 2022-03-29
    75.6
    best: 94.4 (PaLM 2 (few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Question AnsweringonBIG-bench (Navigate)
    Accuracy· 2022-03-29
    52.6
    best: 91.2 (PaLM 2 (few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Question AnsweringonBIG-bench (Ruin Names)
    Accuracy· 2022-03-29
    47.1
    best: 90 (PaLM 2 (few-shot, k=3, Direct))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Question AnsweringonBIG-bench (Hyperbaton)
    Accuracy· 2022-03-29
    54.2
    best: 92 (Bloomberg GPT (few-shot, k=3))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench (Causal Judgment)
    Accuracy· 2022-03-29
    57.4
    best: 62 (PaLM 2 (few-shot, k=3, Direct))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench (Disambiguation QA)
    Accuracy· 2022-03-29
    54.7
    best: 78.8 (PaLM 2 (few-shot, k=3, Direct))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench (Sports Understanding)
    Accuracy· 2022-03-29
    71
    best: 98 (PaLM 2(few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench (Winowhy)
    Accuracy· 2022-03-29
    62.5
    best: 65.9 (PaLM-540B (few-shot, k=5))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench (Known Unknowns)
    Accuracy· 2022-03-29
    65.2
    best: 73.9 (PaLM-540B (few-shot, k=5))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench (Date Understanding)
    Accuracy· 2022-03-29
    52.3
    best: 91.2 (PaLM 2 (few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench (Logical Sequence)
    Accuracy· 2022-03-29
    64.1
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy· 2022-03-29
    85.7
    best: 86.86 (Orca 2-13B)
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Word Sense DisambiguationonBIG-bench (Anachronisms)
    Accuracy· 2022-03-29
    69.1
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Sarcasm DetectiononBIG-bench (SNARKS)
    Accuracy· 2022-03-29
    58.6
    best: 84.8 (PaLM 2(few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    92.8
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    49.4
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy· 2022-03-29
    52.6
    best: 78
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy· 2022-03-29
    75
    best: 78
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    82.4
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    69
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    63.3
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    53.1
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Reading ComprehensiononBIG-bench
    Accuracy · 2022-03-29
    54.5
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy· 2022-03-29
    13.1
    best: 86.86 (Orca 2-13B)
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy· 2022-03-29
    67.7
    best: 86.86 (Orca 2-13B)
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy· 2022-03-29
    68.8
    best: 86.86 (Orca 2-13B)
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy · 2022-03-29
    47.6
    best: 63.6 (Gopher-280B (few-shot, k=5))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy· 2022-03-29
    75
    best: 86.86 (Orca 2-13B)
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy· 2022-03-29
    73
    best: 86.86 (Orca 2-13B)
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Common Sense ReasoningonBIG-bench
    Accuracy · 2022-03-29
    60.3
    best: 63.6 (Gopher-280B (few-shot, k=5))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Emotional IntelligenceonBIG-bench
    Accuracy· 2022-03-29
    66.2
    best: 83.1 (Gopher-280B (few-shot, k=5))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556

Methodology15 results

  • Logical ReasoningonBIG-bench (Penguins In A Table)
    Accuracy· 2022-03-29
    48.7
    best: 84.9 (PaLM 2 (few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench (Logic Grid Puzzle)
    Accuracy· 2022-03-29
    44
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench (Temporal Sequences)
    Accuracy· 2022-03-29
    32
    best: 100 (PaLM 2 (few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench (Formal Fallacies Syllogisms Negation)
    Accuracy· 2022-03-29
    52.1
    best: 64.8 (PaLM 2 (few-shot, k=3, Direct))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench (Reasoning About Colored Objects)
    Accuracy· 2022-03-29
    59.7
    best: 91.2 (PaLM 2 (few-shot, k=3, CoT))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench (Logical Fallacy Detection)
    Accuracy· 2022-03-29
    72.1
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench (StrategyQA)
    Accuracy· 2022-03-29
    68.3
    best: 73.9 (PaLM-540B (few-shot, k=5))
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy· 2022-03-29
    94
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy· 2022-03-29
    79
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy· 2022-03-29
    60.6
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy· 2022-03-29
    93.1
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy· 2022-03-29
    67.1
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy· 2022-03-29
    17.6
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy · 2022-03-29
    56.2
    best: 59.1 (Gopher-280B (few-shot, k=5))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Logical ReasoningonBIG-bench
    Accuracy· 2022-03-29
    49.9
    best: 94
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556

Miscellaneous7 results

  • General KnowledgeonBIG-bench
    Accuracy· 2022-03-29
    94.3
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Human Organs Senses Multiple ChoiceonBIG-bench
    Accuracy · 2022-03-29
    85.7
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Intent RecognitiononBIG-bench
    Accuracy · 2022-03-29
    92.8
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • EthicsonBIG-bench
    Accuracy· 2022-03-29
    57.3
    best: 70 (Gopher-280B (few-shot, k=5))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Fact CheckingonBIG-bench
    Accuracy· 2022-03-29
    65.3
    best: 77.5 (Gopher-280B (few-shot, k=10))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Fact CheckingonBIG-bench
    Accuracy· 2022-03-29
    71.7
    best: 77.5 (Gopher-280B (few-shot, k=10))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • General KnowledgeonBIG-bench
    Accuracy· 2022-03-29
    87
    best: 94.3
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556

Reasoning3 results

  • Analogical SimilarityonBIG-bench
    Accuracy· 2022-03-29
    38.1
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Identify Odd MetaporonBIG-bench
    Accuracy· 2022-03-29
    68.8
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556
  • Odd One OutonBIG-bench
    Accuracy· 2022-03-29
    70.9
    SOTA
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556

Knowledge Base1 result

  • Mathematical ReasoningonBIG-bench
    Accuracy · 2022-03-29
    47.3
    best: 57.6 (Gopher-280B (few-shot, k=5))
    Training Compute-Optimal Large Language ModelsarXiv:2203.15556