EvoEval
EvoEval is a holistic benchmark suite created by evolving HumanEval problems¹. It contains 828 new problems across 5 semantic-altering and 2 semantic-preserving benchmarks¹. EvoEval allows evaluation and comparison across different dimensions and problem types, such as Difficult, Creative, or Tool Use problems¹.
The goal of EvoEval is to provide a comprehensive evaluation of Language Learning Models' (LLMs) coding abilities². It was introduced to address the limitations of existing benchmarks, which contain only a very limited set of problems, both in quantity and variety².
EvoEval can be used to further evolve arbitrary problems to keep up with advances and the ever-changing landscape of LLMs for code². It comes complete with a leaderboard, ground truth solutions, robust test cases, and evaluation scripts to easily fit into your evaluation pipeline¹.
(1) evo-eval/evoeval: EvoEval: Evolving Coding Benchmarks via LLM - GitHub. https://github.com/evo-eval/evoeval. (2) Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval .... https://arxiv.org/abs/2403.19114. (3) evoeval (EvoEval) - Hugging Face. https://huggingface.co/evoeval. (4) EvoEval: Evolving Coding Benchmarks via LLM. https://evo-eval.github.io/. (5) Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval .... https://paperswithcode.com/paper/top-leaderboard-ranking-top-coding. (6) undefined. https://doi.org/10.48550/arXiv.2403.19114.