TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Textbooks Are All You Need II: phi-1.5 technical report

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

2023-09-11Question AnsweringMulti-task Language UnderstandingCommon Sense ReasoningAllCode Generation
PaperPDFCode

Abstract

We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics.

Results

TaskDatasetMetricValueModel
Transfer LearningMMLAverage (%)37.9phi-1.5-web 1.3B
Question AnsweringSIQAAccuracy53phi-1.5-web 1.3B (zero-shot)
Question AnsweringSIQAAccuracy52.6phi-1.5 1.3B (zero-shot)
Question AnsweringPIQAAccuracy77phi-1.5-web (1.3B)
Code GenerationMBPPAccuracy43.5phi-1.5-web 1.3B
Common Sense ReasoningWinoGrandeAccuracy74phi-1.5-web 1.3B (zero-shot)
Common Sense ReasoningARC (Challenge)Accuracy44.9phi-1.5-web 1.3B (zero-shot)
Common Sense ReasoningARC (Easy)Accuracy76.1phi-1.5-web 1.3B (0-shot)
Multi-Task LearningMMLAverage (%)37.9phi-1.5-web 1.3B

Related Papers

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16