Papers With Code 2 | ML Benchmarks, SotA Results & Code

BigCodeBench is an easy-to-use benchmark for code generation with practical and challenging programming tasks¹. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting¹. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls¹.

Here are some key features of BigCodeBench:

Precise evaluation & ranking: It provides a leaderboard for latest LLM rankings before & after rigorous evaluation¹.
Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models¹.
Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies¹.
Test Evaluation: BigCodeBench relies on unittest for evaluating the generated code¹.

(1) GitHub - bigcode-project/bigcodebench: BigCodeBench: The Next .... https://github.com/bigcode-project/bigcodebench/.

BigCodeBench

Related Benchmarks