TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/InternVL: Scaling up Vision Foundation Models and Aligning...

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

2023-12-21CVPR 2024 1Zero-Shot Cross-Modal RetrievalMMR totalVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalZero-Shot Transfer Image ClassificationLarge Language ModelImage-to-Text RetrievalVideo ClassificationRetrievalVisual Question Answering (VQA)Zero-shot Image RetrievalLanguage ModellingVisual Question AnsweringImage Retrieval
PaperPDFCodeCode(official)

Abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy81.2InternVL-C
Image RetrievalFlickr30k-CNR@185.9InternVL-G-FT
Image RetrievalFlickr30k-CNR@1097.1InternVL-G-FT
Image RetrievalFlickr30k-CNR@598.7InternVL-G-FT
Image RetrievalFlickr30k-CNR@185.2InternVL-C-FT
Image RetrievalFlickr30k-CNR@1097InternVL-C-FT
Image RetrievalFlickr30k-CNR@598.5InternVL-C-FT
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@195.7InternVL-G
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.9InternVL-G
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.7InternVL-G
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@185InternVL-G
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1098.6InternVL-G
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@597InternVL-G
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@194.7InternVL-C
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.9InternVL-C
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.6InternVL-C
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@181.7InternVL-C
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1098.2InternVL-C
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@596InternVL-C
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@174.9InternVL-G
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1095.2InternVL-G
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@591.3InternVL-G
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@158.6InternVL-G
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1088InternVL-G
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@581.3InternVL-G
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@170.6InternVL-C
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1093.5InternVL-C
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@589InternVL-C
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@154.1InternVL-C
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1084.6InternVL-C
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@577.3InternVL-C
Zero-Shot Transfer Image ClassificationImageNet V2Accuracy (Private)77.3InternVL-C
Zero-Shot Transfer Image ClassificationImageNet-AAccuracy (Private)83.8InternVL-C
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)83.2InternVL-C
Zero-Shot Transfer Image ClassificationCN-ImageNetAccuracy (Private)64.5InternVL-C
Zero-Shot Transfer Image ClassificationFood-101Top 1 Accuracy95.3InternVL-C
Zero-Shot Transfer Image ClassificationObjectNetAccuracy (Private)80.6InternVL-C
Zero-Shot Transfer Image ClassificationImageNet-SketchAccuracy (Private)73.9InternVL-C
Image-to-Text RetrievalFlickr30kRecall@197.9InternVL-G-FT (finetuned, w/o ranking)
Image-to-Text RetrievalFlickr30kRecall@10100InternVL-G-FT (finetuned, w/o ranking)
Image-to-Text RetrievalFlickr30kRecall@5100InternVL-G-FT (finetuned, w/o ranking)
Image-to-Text RetrievalFlickr30kRecall@197.2InternVL-C-FT (finetuned, w/o ranking)
Image-to-Text RetrievalFlickr30kRecall@10100InternVL-C-FT (finetuned, w/o ranking)
Image-to-Text RetrievalFlickr30kRecall@5100InternVL-C-FT (finetuned, w/o ranking)
MMR totalMRR-BenchmarkTotal Column Score368InternVL2-8B
MMR totalMRR-BenchmarkTotal Column Score237InternVL2-1B
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@146.3InternVL-G
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@1079.6InternVL-G
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@570.5InternVL-G
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@142.4InternVL-G
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@1075.4InternVL-G
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@565.9InternVL-G
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@144.7InternVL-C
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@1078.4InternVL-C
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@568.2InternVL-C
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@140.2InternVL-C
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@1074.1InternVL-C
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@563.1InternVL-C

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17