COFFE: A Code Efficiency Benchmark for Code Generation
COFFE is a Python benchmark for evaluating the time efficiency of LLM-generated code. It is released by the FSE'25 paper "COFFE: A Code Efficiency Benchmark for Code Generation". You can also refer to the project webpage for more details.
COFFE is designed for evaluating both function-level code and file-level code. It contains selected instances from HumanEval, MBPP, APPS and Code Contests. COFFE keeps the original test cases in these benchmarks as correctness test cases and adds new test cases designed for time efficiency evaluation as stressful test cases.
Statistics:
|Category|#Instance|#Solution/Instance|#Correctness/Instance | #Stressful/Instance| |----|----|----|----|----| |Function-Level|398|1.00|5.72|4.99| |File-Level|358|66.93|43.68|4.95|
Data Files:
All instances in COFFE are in Coffe/datasets, where Coffe/datasets/function contains all function-level instances and Coffe/datasets/file contains all file-level instances. In each repo:
best_solutions.json contains the best ground truth solution COFFE uses to calculate efficient@1 and speedup.stressful_testcases.json contains all stressful test cases COFFE adds.solutions.json contains all ground truth solutions in original benchmarks.testcases.json contains all correctness test cases in original benchmarks.Requirements:
We suggest you create a virtural environment before installing COFFE.
workspace/ and execute:cd Coffe && pip install .
docker build . -t coffe
Note that if your network requires proxy, please modify the Dockerfile in Coffe/ to indicate it, otherwise the docker image building process could fail.
cd .. && coffe init
If your installation succeeds, you could see the statistics of COFFE.
When you prepare the predictions from LLMs, COFFE provides a pipeline to calculate the efficient@1 and speedup defined in the paper:
coffe pipe <benchmark_name> <output_repo>
-p <pred_file_1>,...,<pred_file_n>
-f <efficient_at_1/speedup>
-n <number_of_workers>
This command has four phases:
For example:
coffe pipe function Coffe/examples/function -p Coffe/examples/function/GPT-4o.json -f efficient_at_1 -n 8
This command evaluates the predictions from GPT-4o on the function-level instances of COFFE. If you want to evaluate other LLMs, please prepare a JSON file with the same format as Coffe/examples/function/GPT-4o.json.
Prediction File Format:
In the JSON file, the key is the prompt used to query the LLM for the results, you could get the prompts in datasets/function/prompts.json and datasets/file/prompts.json. The value contains two objects, the first is a list contains the raw outputs from LLMs and the second is an indicator for the whether the raw output is valid.
Note:
In default, COFFE will run all predictions in docker. However, if you could not successfully install the docker or want to run the predictions on the host machine, you can add the -x option.
The pipe command provides an entire pipeline for calculating the final metrics. This pipeline could also be completed by executing the following four single evaluation commands.
coffe eval <benchmark_name> <output_repo>
-p <pred_file_1>,...,<pred_file_n>
-m compilable_rate
This commands output a file ending with SOLUTIONS.json that contains the predictions without syntax errors.
coffe eval <benchmark_name> <output_repo>
-p <pred_file_1>,...,<pred_file_n>
-m correctness
-n <number_of_workers>
This commands accept prediction files ending with SOLUTIONS.json and output a file ending with PASSED_SOLUTIONS.json that contains the predictions pass all correctness solutions.
Note:
This command will combine all correct solutions and ground truth solutions together into files <dataset_name>_all_indexes.json (used in step 4) and <dataset_name>_all_PASSED_SOLUTIONS.json for the next step.
coffe eval <benchmark_name> <output_repo>
-p <pred_file>
-m instr_count
-n <number_of_workers>
This command will evaluate the GPU instruction count each prediction consumes and output a file ending with STRESSFUL_INSTRUCTION.json.
Note:
This command could only accept one single prediction file ending with PASSED_SOLUTIONS.json.
coffe eval <benchmark_name> <output_repo>
-p <index_file>,<pred_file>
-m instr_count
-f <efficient_at_1/speedup>
This command calculate the efficient@1 or speedup.
Note: This command requires the index file and the instruction file as COFFE compares the performance of predictions with grouth truth solutions to calculate the metrics.
For details about the stressful test case generation approach STGen, please see stgen/.
If you use COFFE, please cite us:
@misc{peng2025coffe,
title={COFFE: A Code Efficiency Benchmark for Code Generation},
author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren},
year={2025},
eprint={2502.02827},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.02827},
}