TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CodeChain: Towards Modular Code Generation Through Chain o...

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, Shafiq Joty

2023-10-13Code GenerationHumanEval
PaperPDFCode(official)

Abstract

Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.

Results

TaskDatasetMetricValueModel
Code GenerationAPPSCompetition Pass@13.75WizardCoder-15b
Code GenerationAPPSInterview Pass@17.49WizardCoder-15b
Code GenerationAPPSIntroductory Pass@126.29WizardCoder-15b
Code GenerationCodeContestsTest Set pass@12.35CodeChain + WizardCoder-15B
Code GenerationCodeContestsTest Set pass@53.29CodeChain + WizardCoder-15B
Code GenerationCodeContestsVal Set pass@12.48CodeChain + WizardCoder-15B
Code GenerationCodeContestsVal Set pass@53.3CodeChain + WizardCoder-15B

Related Papers

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-07-17MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding2025-07-14CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks2025-07-14CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance2025-07-14