TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Ask Me Anything: A simple strategy for prompting language ...

Ask Me Anything: A simple strategy for prompting language models

Simran Arora, Avanika Narayan, Mayee F. Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, Christopher Ré

2022-10-05Question AnsweringCoreference ResolutionNatural Language InferencePrompt Engineering
PaperPDFCode(official)CodeCode

Abstract

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly "perfect prompt" for a task. To mitigate the high degree of effort involved in prompt-design, we instead ask whether producing multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING (AMA). We first develop an understanding of the effective prompt formats, finding that question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. Output True or False."). Our approach recursively uses the LLM itself to transform task inputs to the effective QA format. We apply the collected prompts to obtain several noisy votes for the input's true label. We find that the prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions for the inputs. We evaluate AMA across open-source model families (e.g., EleutherAI, BLOOM, OPT, and T0) and model sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code here: https://github.com/HazyResearch/ama_prompting

Results

TaskDatasetMetricValueModel
Question AnsweringCOPAAccuracy84Neo-6B (QA + WS)
Question AnsweringCOPAAccuracy77Neo-6B (few-shot)
Question AnsweringCOPAAccuracy58.2Neo-6B (QA)
Question AnsweringNatural QuestionsEM19.7Neo-6B (QA)
Question AnsweringNatural QuestionsEM19.6Neo-6B (QA + WS)
Question AnsweringNatural QuestionsEM13.7Neo-6B (Few-Shot)
Question AnsweringStory ClozeAccuracy87.8Neo-6B (QA + WS)
Question AnsweringStory ClozeAccuracy76.3Neo-6B (QA)
Question AnsweringStory ClozeAccuracy51Neo-6B (few-shot)
Question AnsweringMultiRCF163.8Neo-6B (QA + WS)
Question AnsweringMultiRCF160.8Neo-6B (few-shot)
Question AnsweringMultiRCF158.8Neo-6B (QA)
Question AnsweringBoolQAccuracy67.2Neo-6B (QA + WS)
Question AnsweringBoolQAccuracy66.5Neo-6B (few-shot)
Question AnsweringBoolQAccuracy64.9Neo-6B (QA)
Coreference ResolutionWinograd Schema ChallengeAccuracy77.9Neo-6B (QA + WS)
Coreference ResolutionWinograd Schema ChallengeAccuracy74.7Neo-6B (QA)
Coreference ResolutionWinograd Schema ChallengeAccuracy36.5Neo-6B (few-shot)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16