TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UniWorld-V1: High-Resolution Semantic Encoders for Unified...

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan

2025-06-03Text-to-Image GenerationText to Image GenerationImage EditingImage GenerationImage Manipulation
PaperPDFCodeCode

Abstract

Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.

Results

TaskDatasetMetricValueModel
Image GenerationWISEBiology0.45UniWorld-V1
Image GenerationWISEChemistry0.41UniWorld-V1
Image GenerationWISECultural0.53UniWorld-V1
Image GenerationWISEOverall0.55UniWorld-V1
Image GenerationWISEPhysics0.59UniWorld-V1
Image GenerationWISESpace0.73UniWorld-V1
Image GenerationWISETime0.55UniWorld-V1
Image GenerationGenEvalColor Attri.0.71UniWorld-V1 (Rewrite)
Image GenerationGenEvalColors0.9UniWorld-V1 (Rewrite)
Image GenerationGenEvalCounting0.81UniWorld-V1 (Rewrite)
Image GenerationGenEvalOverall0.84UniWorld-V1 (Rewrite)
Image GenerationGenEvalPosition0.74UniWorld-V1 (Rewrite)
Image GenerationGenEvalSingle Obj.0.98UniWorld-V1 (Rewrite)
Image GenerationGenEvalTwo Obj.0.93UniWorld-V1 (Rewrite)
Image GenerationGenEvalColor Attri.0.7UniWorld-V1
Image GenerationGenEvalColors0.89UniWorld-V1
Image GenerationGenEvalCounting0.79UniWorld-V1
Image GenerationGenEvalOverall0.8UniWorld-V1
Image GenerationGenEvalPosition0.49UniWorld-V1
Image GenerationGenEvalSingle Obj.0.99UniWorld-V1
Image GenerationGenEvalTwo Obj.0.93UniWorld-V1
Text-to-Image GenerationGenEvalColor Attri.0.71UniWorld-V1 (Rewrite)
Text-to-Image GenerationGenEvalColors0.9UniWorld-V1 (Rewrite)
Text-to-Image GenerationGenEvalCounting0.81UniWorld-V1 (Rewrite)
Text-to-Image GenerationGenEvalOverall0.84UniWorld-V1 (Rewrite)
Text-to-Image GenerationGenEvalPosition0.74UniWorld-V1 (Rewrite)
Text-to-Image GenerationGenEvalSingle Obj.0.98UniWorld-V1 (Rewrite)
Text-to-Image GenerationGenEvalTwo Obj.0.93UniWorld-V1 (Rewrite)
Text-to-Image GenerationGenEvalColor Attri.0.7UniWorld-V1
Text-to-Image GenerationGenEvalColors0.89UniWorld-V1
Text-to-Image GenerationGenEvalCounting0.79UniWorld-V1
Text-to-Image GenerationGenEvalOverall0.8UniWorld-V1
Text-to-Image GenerationGenEvalPosition0.49UniWorld-V1
Text-to-Image GenerationGenEvalSingle Obj.0.99UniWorld-V1
Text-to-Image GenerationGenEvalTwo Obj.0.93UniWorld-V1
Image EditingImgEdit-DataAction2.74UniWorld-V1
Image EditingImgEdit-DataAdd3.82UniWorld-V1
Image EditingImgEdit-DataAdjust3.64UniWorld-V1
Image EditingImgEdit-DataBackground2.99UniWorld-V1
Image EditingImgEdit-DataExtract2.27UniWorld-V1
Image EditingImgEdit-DataHybrid2.96UniWorld-V1
Image EditingImgEdit-DataOverall3.26UniWorld-V1
Image EditingImgEdit-DataRemove3.24UniWorld-V1
Image EditingImgEdit-DataReplace3.47UniWorld-V1
Image EditingImgEdit-DataStyle4.21UniWorld-V1
10-shot image generationGenEvalColor Attri.0.71UniWorld-V1 (Rewrite)
10-shot image generationGenEvalColors0.9UniWorld-V1 (Rewrite)
10-shot image generationGenEvalCounting0.81UniWorld-V1 (Rewrite)
10-shot image generationGenEvalOverall0.84UniWorld-V1 (Rewrite)
10-shot image generationGenEvalPosition0.74UniWorld-V1 (Rewrite)
10-shot image generationGenEvalSingle Obj.0.98UniWorld-V1 (Rewrite)
10-shot image generationGenEvalTwo Obj.0.93UniWorld-V1 (Rewrite)
10-shot image generationGenEvalColor Attri.0.7UniWorld-V1
10-shot image generationGenEvalColors0.89UniWorld-V1
10-shot image generationGenEvalCounting0.79UniWorld-V1
10-shot image generationGenEvalOverall0.8UniWorld-V1
10-shot image generationGenEvalPosition0.49UniWorld-V1
10-shot image generationGenEvalSingle Obj.0.99UniWorld-V1
10-shot image generationGenEvalTwo Obj.0.93UniWorld-V1
1 Image, 2*2 StitchiGenEvalColor Attri.0.71UniWorld-V1 (Rewrite)
1 Image, 2*2 StitchiGenEvalColors0.9UniWorld-V1 (Rewrite)
1 Image, 2*2 StitchiGenEvalCounting0.81UniWorld-V1 (Rewrite)
1 Image, 2*2 StitchiGenEvalOverall0.84UniWorld-V1 (Rewrite)
1 Image, 2*2 StitchiGenEvalPosition0.74UniWorld-V1 (Rewrite)
1 Image, 2*2 StitchiGenEvalSingle Obj.0.98UniWorld-V1 (Rewrite)
1 Image, 2*2 StitchiGenEvalTwo Obj.0.93UniWorld-V1 (Rewrite)
1 Image, 2*2 StitchiGenEvalColor Attri.0.7UniWorld-V1
1 Image, 2*2 StitchiGenEvalColors0.89UniWorld-V1
1 Image, 2*2 StitchiGenEvalCounting0.79UniWorld-V1
1 Image, 2*2 StitchiGenEvalOverall0.8UniWorld-V1
1 Image, 2*2 StitchiGenEvalPosition0.49UniWorld-V1
1 Image, 2*2 StitchiGenEvalSingle Obj.0.99UniWorld-V1
1 Image, 2*2 StitchiGenEvalTwo Obj.0.93UniWorld-V1

Related Papers

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining2025-07-18fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16