TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The Unreasonable Effectiveness of Eccentric Automatic Prom...

The Unreasonable Effectiveness of Eccentric Automatic Prompts

Rick Battle, Teja Gollapudi

2024-02-09GSM8KArithmetic Reasoning
PaperPDF

Abstract

Large Language Models (LLMs) have demonstrated remarkable problem-solving and basic mathematics abilities. However, their efficacy is highly contingent on the formulation of the prompt. This study endeavors to quantify the influence of incorporating "positive thinking" into the system message of the prompt, then compare that to systematic prompt optimization. We assess the performance of 60 combinations of system message snippets, tested with and without Chain of Thought prompting, across three models with parameters ranging from 7 to 70 billion on the GSM8K dataset. Our findings reveal that results do not universally generalize across models. In most instances, the inclusion of "positive thinking" prompts positively affected model performance. Notably, however, Llama2-70B exhibited an exception when not utilizing Chain of Thought, as the optimal system message was found to be none at all. Given the combinatorial complexity, and thus computation time, of experimenting with hand-tuning prompts for large black-box models, we then compared the performance of the best "positive thinking" prompt against the output of systematic prompt optimization. We show that employing an automated prompt optimizer emerges as the most effective method for enhancing performance, even when working with smaller open-source models. Additionally, our findings reveal that the highest-scoring, automatically-optimized prompt exhibits a degree of peculiarity far beyond expectations.

Results

TaskDatasetMetricValueModel
Arithmetic ReasoningGSM8KAccuracy61Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)
Arithmetic ReasoningGSM8KParameters (Billion)70Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)
Arithmetic ReasoningGSM8KAccuracy43Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)
Arithmetic ReasoningGSM8KParameters (Billion)13Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)
Arithmetic ReasoningGSM8KAccuracy41Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)
Arithmetic ReasoningGSM8KParameters (Billion)7Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)

Related Papers

GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression2025-07-16KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15CoRE: Enhancing Metacognition with Label-free Self-evaluation in LRMs2025-07-08DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification2025-07-08any4: Learned 4-bit Numeric Representation for LLMs2025-07-07Activation Steering for Chain-of-Thought Compression2025-07-07