TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MindOmni: Unleashing Reasoning Generation in Vision Langua...

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Yicheng Xiao, Lin Song, Yukang Chen, Yingmin Luo, Yuxin Chen, Yukang Gan, Wei Huang, Xiu Li, Xiaojuan Qi, Ying Shan

2025-05-19Text-to-Image GenerationMathematical ReasoningMultimodal Large Language ModelLarge Language ModelImage GenerationLanguage Modelling
PaperPDFCode(official)

Abstract

Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public at \href{https://github.com/EasonXiao-888/MindOmni}{https://github.com/EasonXiao-888/MindOmni}.

Results

TaskDatasetMetricValueModel
Image GenerationWISEBiology0.76MindOmni (w/ cot)
Image GenerationWISEChemistry0.52MindOmni (w/ cot)
Image GenerationWISECultural0.75MindOmni (w/ cot)
Image GenerationWISEOverall0.71MindOmni (w/ cot)
Image GenerationWISEPhysics0.72MindOmni (w/ cot)
Image GenerationWISESpace0.76MindOmni (w/ cot)
Image GenerationWISETime0.7MindOmni (w/ cot)
Image GenerationWISEBiology0.36MindOmni (w/o cot)
Image GenerationWISEChemistry0.32MindOmni (w/o cot)
Image GenerationWISECultural0.4MindOmni (w/o cot)
Image GenerationWISEOverall0.43MindOmni (w/o cot)
Image GenerationWISEPhysics0.52MindOmni (w/o cot)
Image GenerationWISESpace0.62MindOmni (w/o cot)
Image GenerationWISETime0.38MindOmni (w/o cot)
Image GenerationGenEvalColor Attri.0.71MindOmni
Image GenerationGenEvalColors0.9MindOmni
Image GenerationGenEvalCounting0.71MindOmni
Image GenerationGenEvalOverall0.83MindOmni
Image GenerationGenEvalPosition0.71MindOmni
Image GenerationGenEvalSingle Obj.0.99MindOmni
Image GenerationGenEvalTwo Obj.0.94MindOmni
Text-to-Image GenerationGenEvalColor Attri.0.71MindOmni
Text-to-Image GenerationGenEvalColors0.9MindOmni
Text-to-Image GenerationGenEvalCounting0.71MindOmni
Text-to-Image GenerationGenEvalOverall0.83MindOmni
Text-to-Image GenerationGenEvalPosition0.71MindOmni
Text-to-Image GenerationGenEvalSingle Obj.0.99MindOmni
Text-to-Image GenerationGenEvalTwo Obj.0.94MindOmni
10-shot image generationGenEvalColor Attri.0.71MindOmni
10-shot image generationGenEvalColors0.9MindOmni
10-shot image generationGenEvalCounting0.71MindOmni
10-shot image generationGenEvalOverall0.83MindOmni
10-shot image generationGenEvalPosition0.71MindOmni
10-shot image generationGenEvalSingle Obj.0.99MindOmni
10-shot image generationGenEvalTwo Obj.0.94MindOmni
1 Image, 2*2 StitchiGenEvalColor Attri.0.71MindOmni
1 Image, 2*2 StitchiGenEvalColors0.9MindOmni
1 Image, 2*2 StitchiGenEvalCounting0.71MindOmni
1 Image, 2*2 StitchiGenEvalOverall0.83MindOmni
1 Image, 2*2 StitchiGenEvalPosition0.71MindOmni
1 Image, 2*2 StitchiGenEvalSingle Obj.0.99MindOmni
1 Image, 2*2 StitchiGenEvalTwo Obj.0.94MindOmni

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17