TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Composing Ensembles of Pre-trained Models via Iterative Co...

Composing Ensembles of Pre-trained Models via Iterative Consensus

Shuang Li, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Igor Mordatch

2022-10-20Question AnsweringMathematical ReasoningMathVideo Question AnsweringArithmetic ReasoningImage Generation
PaperPDF

Abstract

Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models -- combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation. Project page: https://energy-based-model.github.io/composing-pretrained-models.

Results

TaskDatasetMetricValueModel
Image GenerationImageNet 64x64FID29.184GLIDE + CLIP + CLS + CLS-FREE
Image GenerationImageNet 64x64Inception Score34.952GLIDE + CLIP + CLS + CLS-FREE
Image GenerationImageNet 64x64KID3.766GLIDE + CLIP + CLS + CLS-FREE
Image GenerationImageNet 64x64FID29.219GLIDE + CLS-FREE
Image GenerationImageNet 64x64Inception Score25.926GLIDE + CLS-FREE
Image GenerationImageNet 64x64KID5.325GLIDE + CLS-FREE
Image GenerationImageNet 64x64FID30.462GLIDE + CLIP
Image GenerationImageNet 64x64Inception Score25.017GLIDE + CLIP
Image GenerationImageNet 64x64KID6.174GLIDE + CLIP
Image GenerationImageNet 64x64FID30.871GLIDE + CLS
Image GenerationImageNet 64x64Inception Score22.077GLIDE + CLS
Image GenerationImageNet 64x64KID7.952GLIDE +CLS
Video Question AnsweringActivityNet-QAAccuracy61.2GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
Video Question AnsweringActivityNet-QAAccuracy58.4GPT-2 + CLIP-32 (Zero-Shot)
Arithmetic ReasoningGSM8KAccuracy20.8GPT-2-Medium 355M + question-solution classifier (BS=5)
Arithmetic ReasoningGSM8KParameters (Billion)0.355GPT-2-Medium 355M + question-solution classifier (BS=5)
Arithmetic ReasoningGSM8KAccuracy18.3GPT-2-Medium 355M (fine-tuned, BS=5)
Arithmetic ReasoningGSM8KParameters (Billion)0.355GPT-2-Medium 355M (fine-tuned, BS=5)
Arithmetic ReasoningGSM8KAccuracy16.8GPT-2-Medium 355M + question-solution classifier (BS=1)
Arithmetic ReasoningGSM8KParameters (Billion)0.355GPT-2-Medium 355M + question-solution classifier (BS=1)
Arithmetic ReasoningGSM8KAccuracy12.2GPT-2-Medium 355M (BS=5)
Arithmetic ReasoningGSM8KParameters (Billion)0.355GPT-2-Medium 355M (BS=5)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17