TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MEDITRON-70B: Scaling Medical Pretraining for Large Langua...

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

2023-11-27Question AnsweringFew-Shot LearningZero-Shot LearningConditional Text GenerationMultiple Choice Question Answering (MCQA)
PaperPDFCode(official)

Abstract

Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.

Results

TaskDatasetMetricValueModel
Few-Shot LearningMedConceptsQAAccuracy25.262epfl-llm/meditron-70b
Few-Shot LearningMedConceptsQAAccuracy23.787epfl-llm/meditron-7b
Zero-Shot LearningMedConceptsQAAccuracy25.751epfl-llm/meditron-7b
Zero-Shot LearningMedConceptsQAAccuracy25.36epfl-llm/meditron-70b
Question AnsweringPubMedQAAccuracy81.6Meditron-70B (CoT + SC)
Question AnsweringMedQAAccuracy70.2Meditron-70B (CoT + SC)
Question AnsweringMedQAAccuracy61.5LLAMA-2 (70B SC CoT)
Question AnsweringMedQAAccuracy59.2LLAMA-2 (70B)
Question AnsweringMedMCQADev Set (Acc-%)66Meditron-70B (CoT + SC)
Meta-LearningMedConceptsQAAccuracy25.262epfl-llm/meditron-70b
Meta-LearningMedConceptsQAAccuracy23.787epfl-llm/meditron-7b

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16Warehouse Spatial Question Answering with LLM Agent2025-07-14