TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The Dawn of LMMs: Preliminary Explorations with GPT-4V(isi...

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang, Linjie Li, Kevin Lin, JianFeng Wang, Chung-Ching Lin, Zicheng Liu, Lijuan Wang

2023-09-29MMR total
PaperPDFCodeCode

Abstract

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. Finally, we acknowledge that the model under our study is solely the product of OpenAI's innovative work, and they should be fully credited for its development. Please see the GPT-4V contributions paper for the authorship and credit attribution: https://cdn.openai.com/contributions/gpt-4v.pdf

Results

TaskDatasetMetricValueModel
MMR totalMRR-BenchmarkTotal Column Score415GPT-4V

Related Papers

MMR: Evaluating Reading Ability of Large Multimodal Models2024-08-26Claude 3.5 Sonnet Model Card Addendum2024-06-24GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding2024-06-14What matters when building vision-language models?2024-05-03Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone2024-04-22InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks2023-12-21Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models2023-11-11Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond2023-08-24