Enhancing Code LLM Training with Programmer Attention

Yifan Zhang, Chen Huang, Zachary Karas, Dung Thuy Nguyen, Kevin Leach, Yu Huang

2025-03-19Code Summarization

Abstract

Human attention provides valuable yet underexploited signals for code LLM training, offering a perspective beyond purely machine-driven attention. Despite the complexity and cost of collecting eye-tracking data, there has also been limited progress in systematically using these signals for code LLM training. To address both issues, we propose a cohesive pipeline spanning augmentation and reward-based fine-tuning. Specifically, we introduce (1) an eye-tracking path augmentation method to expand programmer attention datasets, (2) a pattern abstraction step that refines raw fixations into learnable attention motifs, and (3) a reward-guided strategy for integrating these insights directly into a CodeT5 supervised fine-tuning process. Our experiments yield +7.16 in CodeBLEU on the CodeXGlue benchmark for code summarization, underscoring how uniting human and machine attention can boost code intelligence. We hope this work encourages broader exploration of human-centric methods in next-generation AI4SE.

Related Papers

Rethinking the effects of data contamination in Code Intelligence2025-06-03 An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks2025-05-27 LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models2025-05-20 EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective2025-05-18 Variational Prefix Tuning for Diverse and Accurate Code Summarization Using Pre-trained Language Models2025-05-14 Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks2025-04-28 Code-Craft: Hierarchical Graph-Based Code Summarization for Enhanced Context Retrieval2025-04-11 Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization2025-03-13