TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning ...

CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

Martin Weyssow, Aton Kamanda, Xin Zhou, Houari Sahraoui

2024-03-14HumanEval
PaperPDFCode(official)Code

Abstract

Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires a deep assessment of LLMs' outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and LLM outputs. To address this gap, we propose using the LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding preferences. Based on this approach, we present CodeUltraFeedback, a comprehensive dataset designed to facilitate the evaluation and improvement of LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs. These responses are ranked based on five distinct coding preferences using GPT-3.5 as a judge, providing both numerical scores and detailed textual feedback. Our analysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are generally preferred over those from open-weight LLMs, highlighting significant differences in alignment between closed and open-weight models. In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO). The resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in terms of alignment with coding preferences and shows improved functional correctness on the HumanEval+ benchmark compared to the original instruct model. Therefore, our contributions bridge the gap in preference tuning of LLMs for code and set the stage for further advancements in model alignment and RLAIF in automated software engineering.

Related Papers

Turning the Tide: Repository-based Code Reflection2025-07-14Rethinking Verification for LLM Code Generation: From Generation to Testing2025-07-09any4: Learned 4-bit Numeric Representation for LLMs2025-07-07SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization2025-06-25Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models2025-06-23AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need2025-06-18Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees2025-06-17LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing2025-06-17