TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LLaVA-Med: Training a Large Language-and-Vision Assistant ...

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao

2023-06-01NeurIPS 2023 11Question AnsweringInstruction FollowingImage ClassificationReferring expression generationReferring Expression ComprehensionLanguage ModellingVisual Question Answering
PaperPDFCode

Abstract

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

Results

TaskDatasetMetricValueModel
Image ClassificationColonINST-v1 (Seen)Accuray93.84LLaVA-Med-v1.0 (w/o LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Seen)Accuray93.62LLaVA-Med-v1.5 (w/ LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Seen)Accuray93.52LLaVA-Med-v1.0 (w/o LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Seen)Accuray87.22LLaVA-Med-v1.5 (w/ LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray79.24LLaVA-Med-v1.5 (w/ LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray78.04LLaVA-Med-v1.0 (w/o LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray77.38LLaVA-Med-v1.0 (w/o LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray66.51LLaVA-Med-v1.5 (w/ LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray75.25LLaVA-Med-v1.0 (w/o LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray75.07LLaVA-Med-v1.0 (w/o LoRA, w/o extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray73.05LLaVA-Med-v1.5 (w/ LoRA, w/o extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray70LLaVA-Med-v1.5 (w/ LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Seen)Accuray99.3LLaVA-Med-v1.5 (w/ LoRA, w/o extra data)
Referring expression generationColonINST-v1 (Seen)Accuray97.74LLaVA-Med-v1.0 (w/o LoRA, w/o extra data)
Referring expression generationColonINST-v1 (Seen)Accuray97.35LLaVA-Med-v1.0 (w/o LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Seen)Accuray90.4LLaVA-Med-v1.5 (w/ LoRA, w/ extra data)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17