TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Survey on LLM-as-a-Judge

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, Jian Guo

2024-11-23Models Alignment
PaperPDFCode(official)Code(official)

Abstract

Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.

Related Papers

Into the Unknown: From Structure to Disorder in Protein Function Prediction2025-06-06Large Means Left: Political Bias in Large Language Models Increases with Their Number of Parameters2025-05-07AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks2025-05-05Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models2025-02-25A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges2025-01-04InfAlign: Inference-aware language model alignment2024-12-27Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment2024-10-22Negative-Prompt-driven Alignment for Generative Language Model2024-10-16