TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Emu: Generative Pretraining in Multimodality

Emu: Generative Pretraining in Multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

2023-07-11Question AnsweringText-to-Image GenerationText GenerationTemporal/Casual QAImage to textText to Image GenerationVideo Question AnsweringImage CaptioningImage GenerationVisual Question Answering (VQA)Visual Question Answering
PaperPDFCodeCode(official)

Abstract

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAWUPS23.4Emu(0-shot)
Visual Question Answering (VQA)InfiMM-EvalAbductive36.57Emu
Visual Question Answering (VQA)InfiMM-EvalAnalogical18.19Emu
Visual Question Answering (VQA)InfiMM-EvalDeductive28.9Emu
Visual Question Answering (VQA)InfiMM-EvalOverall score28.24Emu
Visual Question Answering (VQA)VQA v2Accuracy57.5Emu-I *
Visual Question Answering (VQA)VizWizAccuracy38.1Emu-I *
Visual Question AnsweringVQA v2Accuracy57.5Emu-I *
Visual Question AnsweringVizWizAccuracy38.1Emu-I *

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17