TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Otter: A Multi-Modal Model with In-Context Instruction Tun...

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu

2023-05-05Instruction FollowingVisual ReasoningVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)InfiMM-EvalAbductive33.64Otter
Visual Question Answering (VQA)InfiMM-EvalAnalogical13.33Otter
Visual Question Answering (VQA)InfiMM-EvalDeductive22.49Otter
Visual Question Answering (VQA)InfiMM-EvalOverall score22.69Otter
Visual Question Answering (VQA)BenchLMMGPT-3.5 score39.13Otter-7B
Visual Question AnsweringBenchLMMGPT-3.5 score39.13Otter-7B

Related Papers

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16How Many Instructions Can LLMs Follow at Once?2025-07-15DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering2025-07-15Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15