TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Language Model Beats Diffusion -- Tokenizer is Key to Visu...

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

2023-10-09Video CompressionVideo PredictionAction RecognitionImage GenerationLanguage ModellingVideo Generation
PaperPDFCodeCodeCode

Abstract

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Results

TaskDatasetMetricValueModel
Image GenerationImageNet 512x512FID1.91MAGVIT-v2
Image GenerationImageNet 512x512Inception score324.3MAGVIT-v2
Image GenerationImageNet 512x512FID3.07MAGVIT-v2 (w/o guidance)
Image GenerationImageNet 512x512Inception score213.1MAGVIT-v2 (w/o guidance)
Image GenerationImageNet 256x256FID1.78MAGVIT-v2
Image GenerationImageNet 256x256FID3.65MAGVIT-v2 (w/o guidance)
VideoUCF-101FVD16109MAGVIT-v2 (AR)
Video GenerationUCF-101FVD16109MAGVIT-v2 (AR)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17