TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle...

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung

2024-03-07Code Completion
PaperPDFCode(official)

Abstract

We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.

Results

TaskDatasetMetricValueModel
Code CompletionSAFIMAPI75.16deepseek-coder-33b-base
Code CompletionSAFIMAlgorithmic60.78deepseek-coder-33b-base
Code CompletionSAFIMAverage69.01deepseek-coder-33b-base
Code CompletionSAFIMControl71.1deepseek-coder-33b-base
Code CompletionSAFIMAPI69.68deepseek-coder-6.7b-base
Code CompletionSAFIMAlgorithmic54.74deepseek-coder-6.7b-base
Code CompletionSAFIMAverage63.4deepseek-coder-6.7b-base
Code CompletionSAFIMControl65.79deepseek-coder-6.7b-base
Code CompletionSAFIMAPI68.06starcoderbase
Code CompletionSAFIMAlgorithmic44.11starcoderbase
Code CompletionSAFIMAverage55.54starcoderbase
Code CompletionSAFIMControl54.46starcoderbase
Code CompletionSAFIMAPI62.58gpt-4-1106-preview
Code CompletionSAFIMAlgorithmic42.11gpt-4-1106-preview
Code CompletionSAFIMAverage53.28gpt-4-1106-preview
Code CompletionSAFIMControl55.15gpt-4-1106-preview
Code CompletionSAFIMAPI59.68CodeLlama-13b-hf
Code CompletionSAFIMAlgorithmic41.41CodeLlama-13b-hf
Code CompletionSAFIMAverage52.78CodeLlama-13b-hf
Code CompletionSAFIMControl57.25CodeLlama-13b-hf
Code CompletionSAFIMAPI62.58deepseek-coder-1.3b-base
Code CompletionSAFIMAlgorithmic41.2deepseek-coder-1.3b-base
Code CompletionSAFIMAverage52.63deepseek-coder-1.3b-base
Code CompletionSAFIMControl54.1deepseek-coder-1.3b-base
Code CompletionSAFIMAPI56.45CodeLlama-34b-hf
Code CompletionSAFIMAlgorithmic38.55CodeLlama-34b-hf
Code CompletionSAFIMAverage49.66CodeLlama-34b-hf
Code CompletionSAFIMControl53.98CodeLlama-34b-hf
Code CompletionSAFIMAPI46.77CodeLlama-7b-hf
Code CompletionSAFIMAlgorithmic34.68CodeLlama-7b-hf
Code CompletionSAFIMAverage45CodeLlama-7b-hf
Code CompletionSAFIMControl53.56CodeLlama-7b-hf
Code CompletionSAFIMAPI53.87gpt-3.5-turbo-0301
Code CompletionSAFIMAlgorithmic31.24gpt-3.5-turbo-0301
Code CompletionSAFIMAverage40.86gpt-3.5-turbo-0301
Code CompletionSAFIMControl37.48gpt-3.5-turbo-0301
Code CompletionSAFIMAPI48.06incoder-6B
Code CompletionSAFIMAlgorithmic25.16incoder-6B
Code CompletionSAFIMAverage33.79incoder-6B
Code CompletionSAFIMControl28.16incoder-6B
Code CompletionSAFIMAPI31.29codegen-16B-multi
Code CompletionSAFIMAlgorithmic25.94codegen-16B-multi
Code CompletionSAFIMAverage30.99codegen-16B-multi
Code CompletionSAFIMControl35.74codegen-16B-multi
Code CompletionSAFIMAPI32.26codegen-2B-multi
Code CompletionSAFIMAlgorithmic23.49codegen-2B-multi
Code CompletionSAFIMAverage29.55codegen-2B-multi
Code CompletionSAFIMControl32.89codegen-2B-multi
Code CompletionSAFIMAPI43.87incoder-1B
Code CompletionSAFIMAlgorithmic21.06incoder-1B
Code CompletionSAFIMAverage29.27incoder-1B
Code CompletionSAFIMControl22.89incoder-1B
Code CompletionSAFIMAPI27.74codegen-6B-multi
Code CompletionSAFIMAlgorithmic23.6codegen-6B-multi
Code CompletionSAFIMAverage28.71codegen-6B-multi
Code CompletionSAFIMControl34.8codegen-6B-multi
Code CompletionSAFIMAPI26.45codegen-350M-multi
Code CompletionSAFIMAlgorithmic16.3codegen-350M-multi
Code CompletionSAFIMAverage22.94codegen-350M-multi
Code CompletionSAFIMControl26.06codegen-350M-multi

Related Papers

Beyond Autocomplete: Designing CopilotLens Towards Transparent and Explainable AI Coding Agents2025-06-24Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models2025-06-23Seed-Coder: Let the Code Model Curate Data for Itself2025-06-04HiLDe: Intentional Code Generation via Human-in-the-Loop Decoding2025-05-28SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development2025-05-22Structure-Aware Corpus Construction and User-Perception-Aligned Metrics for Large-Language-Model Code Completion2025-05-19Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification2025-05-19Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective2025-05-15