Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

2025-01-23Text-to-Image Generation Image Generation

Abstract

Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT

Results

Task	Dataset	Metric	Value	Model
Image Generation	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
Image Generation	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM
Text-to-Image Generation	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
Text-to-Image Generation	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM
10-shot image generation	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
10-shot image generation	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM
1 Image, 2*2 Stitchi	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
1 Image, 2*2 Stitchi	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM

Abstract

Results

Task	Dataset	Metric	Value	Model
Image Generation	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
Image Generation	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM
Text-to-Image Generation	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
Text-to-Image Generation	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM
10-shot image generation	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
10-shot image generation	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM
1 Image, 2*2 Stitchi	GenEval	Overall	0.77	Show-o [xie2024show] PARM It. DPO PARM
1 Image, 2*2 Stitchi	GenEval	Overall	0.75	Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Abstract

Results

Related Papers

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Abstract

Results

Related Papers