Goto

Collaborating Authors

 nl description


David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation

Bauerfeind, Philipp, Salarpour, Amir, Fernandez, David, MohajerAnsari, Pedram, Reschke, Johannes, Pesé, Mert D.

arXiv.org Artificial Intelligence

Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.


L3M+P: Lifelong Planning with Large Language Models

Agarwal, Krish, Jiang, Yuqian, Hu, Jiaheng, Liu, Bo, Stone, Peter

arXiv.org Artificial Intelligence

By combining classical planning methods with large language models (LLMs), recent research such as LLM+P has enabled agents to plan for general tasks given in natural language. However, scaling these methods to general-purpose service robots remains challenging: (1) classical planning algorithms generally require a detailed and consistent specification of the environment, which is not always readily available; and (2) existing frameworks mainly focus on isolated planning tasks, whereas robots are often meant to serve in long-term continuous deployments, and therefore must maintain a dynamic memory of the environment which can be updated with multi-modal inputs and extracted as planning knowledge for future tasks. To address these two issues, this paper introduces L3M+P (Lifelong LLM+P), a framework that uses an external knowledge graph as a representation of the world state. The graph can be updated from multiple sources of information, including sensory input and natural language interactions with humans. L3M+P enforces rules for the expected format of the absolute world state graph to maintain consistency between graph updates. At planning time, given a natural language description of a task, L3M+P retrieves context from the knowledge graph and generates a problem definition for classical planners. Evaluated on household robot simulators and on a real-world service robot, L3M+P achieves significant improvement over baseline methods both on accurately registering natural language state changes and on correctly generating plans, thanks to the knowledge graph retrieval and verification.


Large Language Models for Combinatorial Optimization: A Systematic Review

Da Ros, Francesca, Soprano, Michael, Di Gaspero, Luca, Roitero, Kevin

arXiv.org Artificial Intelligence

This systematic review explores the application of Large Language Models (LLMs) in Combinatorial Optimization (CO). We report our findings using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We conduct a literature search via Scopus and Google Scholar, examining over 2,000 publications. We assess publications against four inclusion and four exclusion criteria related to their language, research focus, publication year, and type. Eventually, we select 103 studies. We classify these studies into semantic categories and topics to provide a comprehensive overview of the field, including the tasks performed by LLMs, the architectures of LLMs, the existing datasets specifically designed for evaluating LLMs in CO, and the field of application. Finally, we identify future directions for leveraging LLMs in this field.


Bootstrapping Human-Like Planning via LLMs

Porfirio, David, Hsiao, Vincent, Fine-Morris, Morgan, Smith, Leslie, Hiatt, Laura M.

arXiv.org Artificial Intelligence

Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot's task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.


PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection

Cotroneo, Domenico, De Rosa, Giuseppe, Liguori, Pietro

arXiv.org Artificial Intelligence

It mentions modifying the put method and altering the release mechanism, leading to potential issues such as deadlocks or inconsistent states but avoids specifying exact code lines. This level provides testers with a broader understanding of the fault's behavior and consequences. In the High-Level Description (bottom right), we make the description entirely abstract and omit technical or contextual details about the specific fault. Modifying the put method introduces a " wrong algorithm small sparse modifications fault " in the fault-free function. This description suits scenarios where a conceptual understanding of the fault type is sufficient without providing implementation specifics. A team of six researchers specialized in computer engineering and cybersecurity created and validated the fault descriptions, under the coordination of a full professor with extensive expertise in software testing and fault injection. The professor established the description style, while the postdoctoral researcher, with a PhD in information technologies and background in AI and fault injection, provided ongoing reviews and feedback. The team, which also included a PhD student in cybersecurity and four M.Sc.


Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations

Renggli, Cedric, Ilyas, Ihab F., Rekatsinas, Theodoros

arXiv.org Artificial Intelligence

In this work, we dive into the fundamental challenges of evaluating Text2SQL solutions and highlight potential failure causes and the potential risks of relying on aggregate metrics in existing benchmarks. We identify two largely unaddressed limitations in current open benchmarks: (1) data quality issues in the evaluation data, mainly attributed to the lack of capturing the probabilistic nature of translating a natural language description into a structured query (e.g., NL ambiguity), and (2) the bias introduced by using different match functions as approximations for SQL equivalence. To put both limitations into context, we propose a unified taxonomy of all Text2SQL limitations that can lead to both prediction and evaluation errors. We then motivate the taxonomy by providing a survey of Text2SQL limitations using state-of-the-art Text2SQL solutions and benchmarks. We describe the causes of limitations with real-world examples and propose potential mitigation solutions for each category in the taxonomy. We conclude by highlighting the open challenges encountered when deploying such mitigation strategies or attempting to automatically apply the taxonomy.


Enhancing AI-based Generation of Software Exploits with Contextual Information

Liguori, Pietro, Improta, Cristina, Natella, Roberto, Cukic, Bojan, Cotroneo, Domenico

arXiv.org Artificial Intelligence

This practical experience report explores Neural Machine Translation (NMT) models' capability to generate offensive security code from natural language (NL) descriptions, highlighting the significance of contextual understanding and its impact on model performance. Our study employs a dataset comprising real shellcodes to evaluate the models across various scenarios, including missing information, necessary context, and unnecessary context. The experiments are designed to assess the models' resilience against incomplete descriptions, their proficiency in leveraging context for enhanced accuracy, and their ability to discern irrelevant information. The findings reveal that the introduction of contextual data significantly improves performance. However, the benefits of additional context diminish beyond a certain point, indicating an optimal level of contextual information for model training. Moreover, the models demonstrate an ability to filter out unnecessary context, maintaining high levels of accuracy in the generation of offensive security code. This study paves the way for future research on optimizing context use in AI-driven code generation, particularly for applications requiring a high degree of technical precision such as the generation of offensive code.


Can LLMs Converse Formally? Automatically Assessing LLMs in Translating and Interpreting Formal Specifications

Karia, Rushang, Dobhal, Daksh, Bramblett, Daniel, Verma, Pulkit, Srivastava, Siddharth

arXiv.org Artificial Intelligence

Automatic system synthesis and verification often require specifications to be provided in a formal language such as propositional logic [Haubelt and Feldmann, 2003, Scholl and Becker, 2001]. Typically, human experts serve as middlemen that can (a) translate natural language (NL) specifications of stakeholders to formal syntax, or (b) explain or interpret the system's functionality by translating the system manual into NL. Given the success of Large Language Models (LLMs) in translation tasks [Xue et al., 2021], utilizing LLMs as middlemen can help in reducing overall system design costs. Thus, it is vital to develop an evaluation methodology that can assess the capabilities of LLMs in such settings. However, developing such a methodology is quite difficult. Firstly, obtaining high-quality datasets - such as those that contain ground truth data that LLMs have not been trained on - is difficult. As LLMs evolve, the dataset would need to evolve as well since it would likely be included as a part of the next-gen LLMs training process. Scaling up existing datasets is challenging since they require human annotators to encode NL text and their formal specifications. Finally, the assessment task must consider both the directions of translation; formal-to-natural and natural-to-formal.


Automating the Correctness Assessment of AI-generated Code for Security Contexts

Cotroneo, Domenico, Foggia, Alessio, Improta, Cristina, Liguori, Pietro, Natella, Roberto

arXiv.org Artificial Intelligence

In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI. Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with human evaluation (Pearson's correlation coefficient r=0.84 on average). Finally, since it is a fully automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ~0.17s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.


Enhancing Robustness of AI Offensive Code Generators via Data Augmentation

Improta, Cristina, Liguori, Pietro, Natella, Roberto, Cukic, Bojan, Cotroneo, Domenico

arXiv.org Artificial Intelligence

In this work, we present a method to add perturbations to the code descriptions to create new inputs in natural language (NL) from well-intentioned developers that diverge from the original ones due to the use of new words or because they miss part of them. The goal is to analyze how and to what extent perturbations affect the performance of AI code generators in the context of security-oriented code. First, we show that perturbed descriptions preserve the semantics of the original, non-perturbed ones. Then, we use the method to assess the robustness of three state-of-the-art code generators against the newly perturbed inputs, showing that the performance of these AI-based solutions is highly affected by perturbations in the NL descriptions. To enhance their robustness, we use the method to perform data augmentation, i.e., to increase the variability and diversity of the NL descriptions in the training data, proving its effectiveness against both perturbed and non-perturbed code descriptions.