Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, Saining Xie
Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Generation | WISE | Biology | 0.49 | MetaQuery-XL |
| Image Generation | WISE | Chemistry | 0.41 | MetaQuery-XL |
| Image Generation | WISE | Cultural | 0.56 | MetaQuery-XL |
| Image Generation | WISE | Overall | 0.55 | MetaQuery-XL |
| Image Generation | WISE | Physics | 0.63 | MetaQuery-XL |
| Image Generation | WISE | Space | 0.62 | MetaQuery-XL |
| Image Generation | WISE | Time | 0.55 | MetaQuery-XL |
| Image Generation | DPG | Overall | 82.05 | MetaQuery-XL |
| Image Generation | GenEval | Overall | 0.8 | MetaQuery-XL (Rewrite) |
| Text-to-Image Generation | DPG | Overall | 82.05 | MetaQuery-XL |
| Text-to-Image Generation | GenEval | Overall | 0.8 | MetaQuery-XL (Rewrite) |
| 10-shot image generation | DPG | Overall | 82.05 | MetaQuery-XL |
| 10-shot image generation | GenEval | Overall | 0.8 | MetaQuery-XL (Rewrite) |
| 1 Image, 2*2 Stitchi | DPG | Overall | 82.05 | MetaQuery-XL |
| 1 Image, 2*2 Stitchi | GenEval | Overall | 0.8 | MetaQuery-XL (Rewrite) |