TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Transfer between Modalities with MetaQueries

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, Saining Xie

2025-04-08Text-to-Image GenerationImage Generation
PaperPDF

Abstract

Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.

Results

TaskDatasetMetricValueModel
Image GenerationWISEBiology0.49MetaQuery-XL
Image GenerationWISEChemistry0.41MetaQuery-XL
Image GenerationWISECultural0.56MetaQuery-XL
Image GenerationWISEOverall0.55MetaQuery-XL
Image GenerationWISEPhysics0.63MetaQuery-XL
Image GenerationWISESpace0.62MetaQuery-XL
Image GenerationWISETime0.55MetaQuery-XL
Image GenerationDPGOverall82.05MetaQuery-XL
Image GenerationGenEvalOverall0.8MetaQuery-XL (Rewrite)
Text-to-Image GenerationDPGOverall82.05MetaQuery-XL
Text-to-Image GenerationGenEvalOverall0.8MetaQuery-XL (Rewrite)
10-shot image generationDPGOverall82.05MetaQuery-XL
10-shot image generationGenEvalOverall0.8MetaQuery-XL (Rewrite)
1 Image, 2*2 StitchiDPGOverall82.05MetaQuery-XL
1 Image, 2*2 StitchiGenEvalOverall0.8MetaQuery-XL (Rewrite)

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16CharaConsist: Fine-Grained Consistent Character Generation2025-07-15CATVis: Context-Aware Thought Visualization2025-07-15