COFFE

COFFE is a Python benchmark for evaluating the time efficiency of LLM-generated code. It is released by the FSE'25 paper "COFFE: A Code Efficiency Benchmark for Code Generation". You can also refer to the project webpage for more details.

Data

COFFE is designed for evaluating both function-level code and file-level code. It contains selected instances from HumanEval, MBPP, APPS and Code Contests. COFFE keeps the original test cases in these benchmarks as correctness test cases and adds new test cases designed for time efficiency evaluation as stressful test cases.

Statistics:

|Category|#Instance|#Solution/Instance|#Correctness/Instance | #Stressful/Instance| |----|----|----|----|----| |Function-Level|398|1.00|5.72|4.99| |File-Level|358|66.93|43.68|4.95|

Data Files:

All instances in COFFE are in Coffe/datasets, where Coffe/datasets/function contains all function-level instances and Coffe/datasets/file contains all file-level instances. In each repo:

best_solutions.json contains the best ground truth solution COFFE uses to calculate efficient@1 and speedup.
stressful_testcases.json contains all stressful test cases COFFE adds.
solutions.json contains all ground truth solutions in original benchmarks.
testcases.json contains all correctness test cases in original benchmarks.

Installation

Requirements:

Linux Machine
Docker
Python>=3.10

We suggest you create a virtural environment before installing COFFE.

To use COFFE, please clone this repo in your current workspace workspace/ and execute:

cd Coffe && pip install .

COFFE comes with a docker image, to install it:

docker build . -t coffe

Note that if your network requires proxy, please modify the Dockerfile in Coffe/ to indicate it, otherwise the docker image building process could fail.

Go back to your workspace and initialize COFFE:

cd .. && coffe init

If your installation succeeds, you could see the statistics of COFFE.

Usage

Pipeline

When you prepare the predictions from LLMs, COFFE provides a pipeline to calculate the efficient@1 and speedup defined in the paper:

coffe pipe <benchmark_name> <output_repo> 
-p <pred_file_1>,...,<pred_file_n> 
-f <efficient_at_1/speedup> 
-n <number_of_workers>

This command has four phases:

santize the predictions.
select the correct predictions based on correctness test cases.
evaluate the GPU instruction count based on stressful test cases.
calculate the final metrics.

For example:

coffe pipe function Coffe/examples/function -p Coffe/examples/function/GPT-4o.json -f efficient_at_1 -n 8

This command evaluates the predictions from GPT-4o on the function-level instances of COFFE. If you want to evaluate other LLMs, please prepare a JSON file with the same format as Coffe/examples/function/GPT-4o.json.

Prediction File Format:

In the JSON file, the key is the prompt used to query the LLM for the results, you could get the prompts in datasets/function/prompts.json and datasets/file/prompts.json. The value contains two objects, the first is a list contains the raw outputs from LLMs and the second is an indicator for the whether the raw output is valid.

Note:

In default, COFFE will run all predictions in docker. However, if you could not successfully install the docker or want to run the predictions on the host machine, you can add the -x option.

Single Evaluation

The pipe command provides an entire pipeline for calculating the final metrics. This pipeline could also be completed by executing the following four single evaluation commands.

Sanitize the predictions

coffe eval <benchmark_name> <output_repo> 
-p <pred_file_1>,...,<pred_file_n> 
-m compilable_rate

This commands output a file ending with SOLUTIONS.json that contains the predictions without syntax errors.

Select correct predictions

coffe eval <benchmark_name> <output_repo> 
-p <pred_file_1>,...,<pred_file_n> 
-m correctness
-n <number_of_workers>

This commands accept prediction files ending with SOLUTIONS.json and output a file ending with PASSED_SOLUTIONS.json that contains the predictions pass all correctness solutions.

Note: This command will combine all correct solutions and ground truth solutions together into files <dataset_name>_all_indexes.json (used in step 4) and <dataset_name>_all_PASSED_SOLUTIONS.json for the next step.

Evaluate the GPU instruction count

coffe eval <benchmark_name> <output_repo> 
-p <pred_file>
-m instr_count
-n <number_of_workers>

This command will evaluate the GPU instruction count each prediction consumes and output a file ending with STRESSFUL_INSTRUCTION.json.

Note: This command could only accept one single prediction file ending with PASSED_SOLUTIONS.json.

Calculating the efficient@1/speedup

coffe eval <benchmark_name> <output_repo> 
-p <index_file>,<pred_file>
-m instr_count
-f <efficient_at_1/speedup>

This command calculate the efficient@1 or speedup.

Note: This command requires the index file and the instruction file as COFFE compares the performance of predictions with grouth truth solutions to calculate the metrics.

STGen

For details about the stressful test case generation approach STGen, please see stgen/.

Cite

If you use COFFE, please cite us:

@misc{peng2025coffe,
      title={COFFE: A Code Efficiency Benchmark for Code Generation}, 
      author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren},
      year={2025},
      eprint={2502.02827},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.02827}, 
}

COFFE

Data

Statistics:

|Category|#Instance|#Solution/Instance|#Correctness/Instance | #Stressful/Instance| |----|----|----|----|----| |Function-Level|398|1.00|5.72|4.99| |File-Level|358|66.93|43.68|4.95|

Data Files:

All instances in COFFE are in Coffe/datasets, where Coffe/datasets/function contains all function-level instances and Coffe/datasets/file contains all file-level instances. In each repo:

best_solutions.json contains the best ground truth solution COFFE uses to calculate efficient@1 and speedup.
stressful_testcases.json contains all stressful test cases COFFE adds.
solutions.json contains all ground truth solutions in original benchmarks.
testcases.json contains all correctness test cases in original benchmarks.

Installation

Requirements:

Linux Machine
Docker
Python>=3.10

We suggest you create a virtural environment before installing COFFE.

To use COFFE, please clone this repo in your current workspace workspace/ and execute:

cd Coffe && pip install .

COFFE comes with a docker image, to install it:

docker build . -t coffe

Note that if your network requires proxy, please modify the Dockerfile in Coffe/ to indicate it, otherwise the docker image building process could fail.

Go back to your workspace and initialize COFFE:

cd .. && coffe init

If your installation succeeds, you could see the statistics of COFFE.

Usage

Pipeline

When you prepare the predictions from LLMs, COFFE provides a pipeline to calculate the efficient@1 and speedup defined in the paper:

coffe pipe <benchmark_name> <output_repo> 
-p <pred_file_1>,...,<pred_file_n> 
-f <efficient_at_1/speedup> 
-n <number_of_workers>

This command has four phases:

santize the predictions.
select the correct predictions based on correctness test cases.
evaluate the GPU instruction count based on stressful test cases.
calculate the final metrics.

For example:

coffe pipe function Coffe/examples/function -p Coffe/examples/function/GPT-4o.json -f efficient_at_1 -n 8

Prediction File Format:

Note:

In default, COFFE will run all predictions in docker. However, if you could not successfully install the docker or want to run the predictions on the host machine, you can add the -x option.

Single Evaluation

The pipe command provides an entire pipeline for calculating the final metrics. This pipeline could also be completed by executing the following four single evaluation commands.

Sanitize the predictions

coffe eval <benchmark_name> <output_repo> 
-p <pred_file_1>,...,<pred_file_n> 
-m compilable_rate

This commands output a file ending with SOLUTIONS.json that contains the predictions without syntax errors.

Select correct predictions

coffe eval <benchmark_name> <output_repo> 
-p <pred_file_1>,...,<pred_file_n> 
-m correctness
-n <number_of_workers>

This commands accept prediction files ending with SOLUTIONS.json and output a file ending with PASSED_SOLUTIONS.json that contains the predictions pass all correctness solutions.

Evaluate the GPU instruction count

coffe eval <benchmark_name> <output_repo> 
-p <pred_file>
-m instr_count
-n <number_of_workers>

This command will evaluate the GPU instruction count each prediction consumes and output a file ending with STRESSFUL_INSTRUCTION.json.

Note: This command could only accept one single prediction file ending with PASSED_SOLUTIONS.json.

Calculating the efficient@1/speedup

coffe eval <benchmark_name> <output_repo> 
-p <index_file>,<pred_file>
-m instr_count
-f <efficient_at_1/speedup>

This command calculate the efficient@1 or speedup.

Note: This command requires the index file and the instruction file as COFFE compares the performance of predictions with grouth truth solutions to calculate the metrics.

STGen

For details about the stressful test case generation approach STGen, please see stgen/.

Cite

If you use COFFE, please cite us:

@misc{peng2025coffe,
      title={COFFE: A Code Efficiency Benchmark for Code Generation}, 
      author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren},
      year={2025},
      eprint={2502.02827},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.02827}, 
}