Hajipour, Hossein
SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in Fine-tuned Source Code Models
Hajipour, Hossein, Yu, Ning, Staicu, Cristian-Alexandru, Fritz, Mario
Large code datasets have become increasingly accessible for pre-training source code models. However, for the fine-tuning phase, obtaining representative training data that fully covers the code distribution for specific downstream tasks remains challenging due to the task-specific nature and limited labeling resources. Moreover, fine-tuning pretrained models can result in forgetting previously acquired pre-training knowledge. These lead to out-of-distribution (OOD) generalization issues with unexpected model inference behaviors that have not been systematically studied yet. In this paper, we contribute the first systematic approach that simulates various OOD scenarios along different dimensions of source code data properties and study the fine-tuned model behaviors in such scenarios. We investigate the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods. Our comprehensive analysis, conducted on four state-of-the-art pretrained models and applied to two code generation tasks, exposes multiple failure modes attributed to OOD generalization issues. Additionally, our analysis uncovers that LoRA finetuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios. There has been increasing success in applying Large Language Models (LLMs) to various source code understanding and generation tasks. LLMs for codes such as CodeBERT Feng et al. (2020), GraphCodeBERT Guo et al. (2021), CodeT5+ Wang et al. (2023), CodeGen Nijkamp et al. (2023), and Code Llama Roziรจre et al. (2023) are pretrained using large-scale source code Figure 1: Our approach simulates out-ofdistribution datasets and serve as universal initialization for a (OOD) scenarios and analyzes the variety of downstream tasks. These tasks include corresponding behaviors of models. The emerging abilities of LLMs, such as in-context learning, demonstrate their potential to handle a wide range of tasks (Wei et al., 2022; Brown et al., 2020). However, it has been shown that not all tasks can be effectively addressed by relying only on the pretrained LLMs Anil et al. (2022). To adapt pretrained models for specific tasks, they can be fine-tuned with specific datasets for each downstream task. This fine-tuning process can involve optimizing all parameters or adopting a parameter-efficient approach (Houlsby et al., 2019; Hu et al., 2022), such as Low-Rank Adaptation (LoRA)Hu et al. (2022).
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models
Hajipour, Hossein, Hassler, Keno, Holz, Thorsten, Schรถnherr, Lea, Fritz, Mario
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair programming, and tools such as GitHub Copilot have emerged as part of the daily programming workflow used by millions of developers. The training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. While these models have been extensively assessed for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models. In this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. To this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. To achieve this, we present an approach to approximate inversion of the black-box code generation models based on few-shot prompting. We evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. Furthermore, we establish a collection of diverse non-secure prompts for various vulnerability scenarios using our method. This dataset forms a benchmark for evaluating and comparing the security weaknesses in code language models.
IReEn: Iterative Reverse-Engineering of Black-Box Functions via Neural Program Synthesis
Hajipour, Hossein, Malinowski, Mateusz, Fritz, Mario
In this work, we investigate the problem of revealing the functionality of a black-box agent. Notably, we are interested in the interpretable and formal description of the behavior of such an agent. Ideally, this description would take the form of a program written in a high-level language. This task is also known as reverse engineering and plays a pivotal role in software engineering, computer security, but also most recently in interpretability. In contrast to prior work, we do not rely on privileged information on the black box, but rather investigate the problem under a weaker assumption of having only access to inputs and outputs of the program. We approach this problem by iteratively refining a candidate set using a generative neural program synthesis approach until we arrive at a functionally equivalent program. We assess the performance of our approach on the Karel dataset. Our results show that the proposed approach outperforms the state-of-the-art on this challenge by finding a functional equivalent program in 78% of cases -- even exceeding prior work that had privileged information on the black-box.
SampleFix: Learning to Correct Programs by Sampling Diverse Fixes
Hajipour, Hossein, Bhattacharya, Apratim, Fritz, Mario
Automatic program correction is an active topic of research, which holds the potential of dramatically improving productivity of programmers during the software development process and correctness of software in general. Recent advances in machine learning, deep learning and NLP have rekindled the hope to eventually fully automate the process of repairing programs. A key challenges is ambiguity, as multiple codes -- or fixes -- can implement the same functionality. In addition, dataset by nature fail to capture the variance introduced by such ambiguities. Therefore, we propose a deep generative model to automatically correct programming errors by learning a distribution of potential fixes. Our model is formulated as a deep conditional variational autoencoder that samples diverse fixes for the given erroneous programs. In order to account for ambiguity and inherent lack of representative datasets, we propose a novel regularizer to encourage the model to generate diverse fixes. Our evaluations on common programming errors show for the first time the generation of diverse fixes and strong improvements over the state-of-the-art approaches by fixing up to $61\%$ of the mistakes.