TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CogVLM: Visual Expert for Pretrained Language Models

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

2023-11-06Long-Context UnderstandingVisual Question Answering (VQA)1 Image, 2*2 StitchingFS-MEVQALanguage ModellingVisual Question AnsweringImage Retrieval
PaperPDFCodeCode(official)CodeCode

Abstract

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)InfiMM-EvalAbductive47.88CogVLM-Chat
Visual Question Answering (VQA)InfiMM-EvalAnalogical28.75CogVLM-Chat
Visual Question Answering (VQA)InfiMM-EvalDeductive36.75CogVLM-Chat
Visual Question Answering (VQA)InfiMM-EvalOverall score37.16CogVLM-Chat
Visual Question Answering (VQA)MM-VetGPT-4 score63.9GLM4 Vision
Visual Question Answering (VQA)MM-VetGPT-4 score52.8CogVLM(Vicuna-7B)
Visual Question Answering (VQA)SME#Learning Samples (N)16GLM-4V
Visual Question Answering (VQA)SMEACC34.23GLM-4V
Visual Question Answering (VQA)SMEBLEU-414.45GLM-4V
Visual Question Answering (VQA)SMECIDEr127.37GLM-4V
Visual Question Answering (VQA)SMEDetection0.89GLM-4V
Visual Question Answering (VQA)SMEMETEOR17.53GLM-4V
Visual Question Answering (VQA)SMEROUGE-L24.28GLM-4V
Visual Question Answering (VQA)SMESPICE17.7GLM-4V
Visual Question AnsweringMM-VetGPT-4 score63.9GLM4 Vision
Visual Question AnsweringMM-VetGPT-4 score52.8CogVLM(Vicuna-7B)
Visual Question AnsweringSME#Learning Samples (N)16GLM-4V
Visual Question AnsweringSMEACC34.23GLM-4V
Visual Question AnsweringSMEBLEU-414.45GLM-4V
Visual Question AnsweringSMECIDEr127.37GLM-4V
Visual Question AnsweringSMEDetection0.89GLM-4V
Visual Question AnsweringSMEMETEOR17.53GLM-4V
Visual Question AnsweringSMEROUGE-L24.28GLM-4V
Visual Question AnsweringSMESPICE17.7GLM-4V
Long-Context UnderstandingMMNeedle1 Image, 2*2 Stitching, Exact Accuracy7.3CogVLM2-Llama-3
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy0.9CogVLM2-Llama-3
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy0.1CogVLM2-Llama-3
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy0.1CogVLM-17B
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy0.3CogVLM-17B
Explanatory Visual Question AnsweringSME#Learning Samples (N)16GLM-4V
Explanatory Visual Question AnsweringSMEACC34.23GLM-4V
Explanatory Visual Question AnsweringSMEBLEU-414.45GLM-4V
Explanatory Visual Question AnsweringSMECIDEr127.37GLM-4V
Explanatory Visual Question AnsweringSMEDetection0.89GLM-4V
Explanatory Visual Question AnsweringSMEMETEOR17.53GLM-4V
Explanatory Visual Question AnsweringSMEROUGE-L24.28GLM-4V
Explanatory Visual Question AnsweringSMESPICE17.7GLM-4V

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16