TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LLaMA-Adapter V2: Parameter-Efficient Visual Instruction M...

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao

2023-04-28Zero-Shot Video Question AnswerInstruction FollowingVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Visual Question Answering (VQA)Visual Question AnsweringOptical Character Recognition (OCR)
PaperPDFCodeCodeCode(official)

Abstract

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy54.9LLaMA Adapter-7B
Question AnsweringMSVD-QAConfidence Score3.1LLaMA Adapter-7B
Question AnsweringMSRVTT-QAAccuracy43.8LLaMA Adapter-7B
Question AnsweringMSRVTT-QAConfidence Score2.7LLaMA Adapter-7B
Question AnsweringActivityNet-QAAccuracy34.2LLaMA Adapter
Question AnsweringActivityNet-QAConfidence Score2.7LLaMA Adapter
Visual Question Answering (VQA)InfiMM-EvalAbductive46.12LLaMA-Adapter V2
Visual Question Answering (VQA)InfiMM-EvalAnalogical22.08LLaMA-Adapter V2
Visual Question Answering (VQA)InfiMM-EvalDeductive28.7LLaMA-Adapter V2
Visual Question Answering (VQA)InfiMM-EvalOverall score30.46LLaMA-Adapter V2
Visual Question Answering (VQA)VideoInstructConsistency2.15LLaMA Adapter
Visual Question Answering (VQA)VideoInstructContextual Understanding2.3LLaMA Adapter
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.03LLaMA Adapter
Visual Question Answering (VQA)VideoInstructDetail Orientation2.32LLaMA Adapter
Visual Question Answering (VQA)VideoInstructTemporal Understanding1.98LLaMA Adapter
Visual Question Answering (VQA)VideoInstructmean2.16LLaMA Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.3LLaMA Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.03LLaMA Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.32LLaMA Adapter
Visual Question Answering (VQA)VideoInstructgpt-score1.98LLaMA Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.15LLaMA Adapter
Video Question AnsweringActivityNet-QAAccuracy34.2LLaMA Adapter V2
Video Question AnsweringActivityNet-QAConfidence score2.7LLaMA Adapter V2
Video Question AnsweringMSVD-QAAccuracy54.9LLaMA Adapter-7B
Video Question AnsweringMSVD-QAConfidence Score3.1LLaMA Adapter-7B
Video Question AnsweringMSRVTT-QAAccuracy43.8LLaMA Adapter-7B
Video Question AnsweringMSRVTT-QAConfidence Score2.7LLaMA Adapter-7B
Video Question AnsweringActivityNet-QAAccuracy34.2LLaMA Adapter
Video Question AnsweringActivityNet-QAConfidence Score2.7LLaMA Adapter
Generative Visual Question AnsweringVideoInstructConsistency2.15LLaMA Adapter
Generative Visual Question AnsweringVideoInstructContextual Understanding2.3LLaMA Adapter
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.03LLaMA Adapter
Generative Visual Question AnsweringVideoInstructDetail Orientation2.32LLaMA Adapter
Generative Visual Question AnsweringVideoInstructTemporal Understanding1.98LLaMA Adapter
Generative Visual Question AnsweringVideoInstructmean2.16LLaMA Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.3LLaMA Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.03LLaMA Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.32LLaMA Adapter
Generative Visual Question AnsweringVideoInstructgpt-score1.98LLaMA Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.15LLaMA Adapter
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score2.03LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.15LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding2.3LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.03LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.32LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding1.98LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructmean2.16LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.3LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.03LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.32LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score1.98LLaMA Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.15LLaMA Adapter

Related Papers

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16How Many Instructions Can LLMs Follow at Once?2025-07-15DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering2025-07-15Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15