Learning Code Preference via Synthetic Evolution

Liu, Jiawei, Nguyen, Thanh, Shang, Mingyue, Ding, Hantian, Li, Xiaopeng, Yu, Yu, Kumar, Varun, Wang, Zijian

arXiv.org Artificial Intelligence 

Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? Furthermore, we discover the prohibitive costs and limitations of human-based code preference: despite spending 23.4 person-minutes on each task, 15.1 40.3% of tasks remain unsolved. Compared to model-based preference, human preference tends to be more accurate under the objective of code correctness, while being sub-optimal for non-functional objectives. Large Language Models (LLMs) for code (Chen et al., 2021; GitHub, 2023; Amazon Web Services, 2023) have become instrumental in modern software development. Code LLMs assist developers in various scenarios, from suggesting code completions and generating functional code based on user instructions to proposing complex code changes to resolve bug reports and feature requests. Instruction-tuned LLMs (Luo et al., 2024; Wei et al., 2024) are increasingly adept at generating functional code based on natural language instructions. However, evaluating the quality of LLM-generated code remains challenging, particularly regarding code correctness, efficiency, security, adherence to best practices, and alignment with developer preferences. Effectively and efficiently assessing LLM-generated code against these properties is crucial for both evaluation (Liu et al., 2023b) and preference optimization for code LLMs (Weyssow et al., 2024). Nevertheless, the subject of learning code preferences has been largely under-explored, motivating us to study code preferences systematically and train code preference models with new data and modeling methods. Following the established format in LLM-as-a-judge (Chiang et al., 2024), we define the code preference task as follows: Given a user query, a pair of two candidate code responses, and optionally a preference criterion, code preference is demonstrated by choosing one response over the other. Work done during a research internship at AWS AI Labs. Code execution: Code preference in another way can be confidently determined by execution statuses (Liu et al., 2023a). However, applying code execution to arbitrary programs poses challenges due to (i) setup complexity, (ii) code incompleteness, and (iii) execution overhead.