Goto

Collaborating Authors

 analysis tool



Ensuring Functional Correctness of Large Code Models with Selective Generation

Jeong, Jaewoo, Kim, Taesoo, Park, Sangdon

arXiv.org Artificial Intelligence

The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our method along with the controllability of code hallucination and reasonable selection efficiency.


Learning to Triage Taint Flows Reported by Dynamic Program Analysis in Node.js Packages

Ni, Ronghao, Yang, Aidan Z. H., Hsu, Min-Chien, Sabino, Nuno, Jia, Limin, Martins, Ruben, Cassel, Darion, Cheang, Kevin

arXiv.org Artificial Intelligence

Program analysis tools often produce large volumes of candidate vulnerability reports that require costly manual review, creating a practical challenge: how can security analysts prioritize the reports most likely to be true vulnerabilities? This paper investigates whether machine learning can be applied to prioritizing vulnerabilities reported by program analysis tools. We focus on Node.js packages and collect a benchmark of 1,883 Node.js packages, each containing one reported ACE or ACI vulnerability. We evaluate a variety of machine learning approaches, including classical models, graph neural networks (GNNs), large language models (LLMs), and hybrid models that combine GNN and LLMs, trained on data based on a dynamic program analysis tool's output. The top LLM achieves $F_{1} {=} 0.915$, while the best GNN and classical ML models reaching $F_{1} {=} 0.904$. At a less than 7% false-negative rate, the leading model eliminates 66.9% of benign packages from manual review, taking around 60 ms per package. If the best model is tuned to operate at a precision level of 0.8 (i.e., allowing 20% false positives amongst all warnings), our approach can detect 99.2% of exploitable taint flows while missing only 0.8%, demonstrating strong potential for real-world vulnerability triage.



Analysing Python Machine Learning Notebooks with Moose

Mignard, Marius, Costiou, Steven, Anquetil, Nicolas, Etien, Anne

arXiv.org Artificial Intelligence

Machine Learning (ML) code, particularly within notebooks, often exhibits lower quality compared to traditional software. Bad practices arise at three distinct levels: general Python coding conventions, the organizational structure of the notebook itself, and ML-specific aspects such as reproducibility and correct API usage. However, existing analysis tools typically focus on only one of these levels and struggle to capture ML-specific semantics, limiting their ability to detect issues. This paper introduces Vespucci Linter, a static analysis tool with multi-level capabilities, built on Moose and designed to address this challenge. Leveraging a metamodeling approach that unifies the notebook's structural elements with Python code entities, our linter enables a more contextualized analysis to identify issues across all three levels. We implemented 22 linting rules derived from the literature and applied our tool to a corpus of 5,000 notebooks from the Kaggle platform. The results reveal violations at all levels, validating the relevance of our multi-level approach and demonstrating Vespucci Linter's potential to improve the quality and reliability of ML development in notebook environments.


QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration

Hu, Junze, Jin, Xiangyu, Zeng, Yizhe, Liu, Yuling, Li, Yunpeng, Du, Dan, Xie, Kaiyu, Zhu, Hongsong

arXiv.org Artificial Intelligence

-- Code auditing, a method where security researchers review source code to identify vulnerabilities, has become increasingly impractical for large-scale open-source projects. While Large Language Models (LLMs) demonstrate impressive code generation capabilities, they are constrained by limitations in context window size, memory capacity, and complex reasoning abilities, making direct vulnerability detection across entire projects infeasible. Static code analysis tools, though effective to a degree, are heavily reliant on their predefined scanning rules. T o address these challenges, we present QLPro, a vulnerability detection framework that systematically integrates LLMs with static code analysis tools. QLPro introduces both a triple-voting mechanism and a three-role mechanism to enable fully automated vulnerability detection across entire open-source projects without human intervention. Specifically, QLPro first utilizes static analysis tools to extract all taint specifications from a project, then employs LLMs and the triple-voting mechanism to classify and match these taint specifications, thereby enhancing both the accuracy and appropriateness of taint specification classification.


AI Software Engineer: Programming with Trust

Roychoudhury, Abhik, Pasareanu, Corina, Pradel, Michael, Ray, Baishakhi

arXiv.org Artificial Intelligence

Columbia University, USA Large Language Models (LLMs) have shown surprising proficie ncy in generating code snippets, promising to automate large parts of software engineering via artifici al intelligence (AI). We argue that successfully deploying AI software engineers requires a level of trust eq ual to or even greater than the trust established by human-driven software engineering practices. The recen t trend toward LLM agents offers a path toward integrating the power of LLMs to create new code with the powe r of analysis tools to increase trust in the code. This opinion piece comments on whether LLM agents could dominate software engineering workflows in the future and whether the focus of programming will shift from programming at scale to programming with trust. Software engineering is undergoing a significant phase of greater au tomation owing to the emergence of Large Language Models (LLMs) for code.


Multi-Programming Language Sandbox for LLMs

Dou, Shihan, Zhang, Jiazheng, Zang, Jianxiang, Tao, Yunbo, Zhou, Weikang, Jia, Haoxiang, Liu, Shichun, Yang, Yuming, Xi, Zhiheng, Wu, Shenxi, Zhang, Shaoqing, Wu, Muling, Lv, Changze, Xiong, Limao, Zhan, Wenyu, Zhang, Lin, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang, Wu, Yueming, Wen, Ming, Zheng, Rui, Ji, Tao, Cao, Yixin, Gui, Tao, Qiu, Xipeng, Zhang, Qi, Huang, Xuanjing

arXiv.org Artificial Intelligence

We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox also integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. MPLSandbox can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of their generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we integrate it into training and deployment approaches, and also employ it to optimize workflows for a wide range of real-world code-related tasks. Our goal is to enhance researcher productivity on LLM-based code-related tasks by simplifying and automating workflows through delegation to MPLSandbox.


Unintentional Security Flaws in Code: Automated Defense via Root Cause Analysis

Islam, Nafis Tanveer, Bethany, Mazal, Manuel, Dylan, Jadliwala, Murtuza, Najafirad, Peyman

arXiv.org Artificial Intelligence

Software security remains a critical concern, particularly as junior developers, often lacking comprehensive knowledge of security practices, contribute to codebases. While there are tools to help developers proactively write secure code, their actual effectiveness in helping developers fix their vulnerable code remains largely unmeasured. Moreover, these approaches typically focus on classifying and localizing vulnerabilities without highlighting the specific code segments that are the root cause of the issues, a crucial aspect for developers seeking to fix their vulnerable code. To address these challenges, we conducted a comprehensive study evaluating the efficacy of existing methods in helping junior developers secure their code. Our findings across five types of security vulnerabilities revealed that current tools enabled developers to secure only 36.2\% of vulnerable code. Questionnaire results from these participants further indicated that not knowing the code that was the root cause of the vulnerability was one of their primary challenges in repairing the vulnerable code. Informed by these insights, we developed an automated vulnerability root cause (RC) toolkit called T5-RCGCN, that combines T5 language model embeddings with a graph convolutional network (GCN) for vulnerability classification and localization. Additionally, we integrated DeepLiftSHAP to identify the code segments that were the root cause of the vulnerability. We tested T5-RCGCN with 56 junior developers across three datasets, showing a 28.9\% improvement in code security compared to previous methods. Developers using the tool also gained a deeper understanding of vulnerability root causes, resulting in a 17.0\% improvement in their ability to secure code independently. These results demonstrate the tool's potential for both immediate security enhancement and long-term developer skill growth.


Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches

Canelas, Paulo, Schmerl, Bradley, Fonseca, Alcides, Timperley, Christopher S.

arXiv.org Artificial Intelligence

The Robot Operating System (ROS) is a popular framework and ecosystem that allows developers to build robot software systems from reusable, off-the-shelf components. Systems are often built by customizing and connecting components via configuration files. While reusable components theoretically allow rapid prototyping, ensuring proper configuration and connection is challenging, as evidenced by numerous questions on developer forums. Developers must abide to the often unchecked and unstated assumptions of individual components. Failure to do so can result in misconfigurations that are only discovered during field deployment, at which point errors may lead to unpredictable and dangerous behavior. Despite misconfigurations having been studied in the broader context of software engineering, robotics software (and ROS in particular) poses domain-specific challenges with potentially disastrous consequences. To understand and improve the reliability of ROS projects, it is critical to identify the types of misconfigurations faced by developers. To that end, we perform a study of ROS Answers, a Q&A platform, to identify and categorize misconfigurations that occur during ROS development. We then conduct a literature review to assess the coverage of these misconfigurations by existing detection techniques. In total, we find 12 high-level categories and 50 sub-categories of misconfigurations. Of these categories, 27 are not covered by existing techniques. To conclude, we discuss how to tackle those misconfigurations in future work.