AITopics | test suite

Collaborating Authors

test suite

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Neural Attribution for Semantic Bug-Localization in Student Programs

Rahul Gupta, Aditya Kanade, Shirish Shevade

Neural Information Processing SystemsFeb-15-2026, 01:38:50 GMT

Most open online courses on programming makeuse ofautomated grading systems tosupport programming assignments and give real-time feedback. These systems usually rely on test results toquantify theprograms' functional correctness.

artificial intelligence, buggy line, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Industry: Education > Educational Setting > Online (0.90)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

RankingPolicyDecisions

Neural Information Processing SystemsFeb-8-2026, 12:15:46 GMT

Inarunwith ntimesteps,apolicy will makendecisions on actions totake; we conjecture that only asmall subset of these decisions delivers value over selecting a simple default action. Given atrained policy,we propose anovel black-box method based on statistical fault localisation that ranks thestates oftheenvironment according totheimportance ofdecisions made inthose states. Weargue that among other things, theranked list ofstates can help explain and understand the policy. As the ranking method is statistical, a direct evaluation of its quality is hard.

artificial intelligence, execution, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

The Geometry of Benchmarks: A New Path Toward AGI

Chojecki, Przemyslaw

arXiv.org Artificial IntelligenceDec-5-2025

Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient $κ$ as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for $κ> 0$. Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.

artificial intelligence, battery, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2512.04276

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)

Add feedback

Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Zhao, Songwen, Wang, Danqing, Zhang, Kexun, Luo, Jiaxuan, Li, Zhuo, Li, Lei

arXiv.org Artificial IntelligenceDec-4-2025

Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.

large language model, machine learning, programming language, (19 more...)

arXiv.org Artificial Intelligence

2512.03262

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

Add feedback

DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation

Pathak, Abhijeet, Barua, Suvadra, Gudimetla, Dinesh, Patir, Rupam, Guo, Jiawei, Hu, Hongxin, Cai, Haipeng

arXiv.org Artificial IntelligenceNov-27-2025

Large language models (LLMs) and autonomous coding agents are increasingly used to generate software across a wide range of domains. Yet a core requirement remains unmet: ensuring that generated code is secure without compromising its functional correctness. Existing benchmarks and evaluations for secure code generation fall short-many measure only vulnerability reduction, disregard correctness preservation, or evaluate security and functionality on separate datasets, violating the fundamental need for simultaneous joint evaluation. We present DUALGAUGE, the first fully automated benchmarking framework designed to rigorously evaluate the security and correctness of LLM-generated code in unison. Given the lack of datasets enabling joint evaluation of secure code generation, we also present DUALGAUGE-BENCH, a curated benchmark suite of diverse coding tasks, each paired with manually validated test suites for both security and functionality, designed for full coverage of specification requirements. At the core of DUALGAUGE is an agentic program executor, which runs a program against given tests in sandboxed environments, and an LLM-based evaluator, which assesses both correctness and vulnerability behavior against expected outcomes. We rigorously evaluated and ensured the quality of DUALGAUGE-BENCH and the accuracy of DUALGAUGE, and applied DUALGAUGE to benchmarking ten leading LLMs on DUALGAUGE-BENCH across thousands of test scenarios. Our results reveal critical gaps in correct and secure code generation by these LLMs, for which our open-source system and datasets help accelerate progress via reproducible, scalable, and rigorous evaluation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.20709

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

Lops, Andrea, Narducci, Fedelucio, Ragone, Azzurra, Trizio, Michelantonio, Bartolini, Claudio

arXiv.org Artificial IntelligenceNov-27-2025

Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.20403

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models

Zadorojniy, Alexander, Wasserkrug, Segev, Farchi, Eitan

arXiv.org Artificial IntelligenceNov-21-2025

Recently, using Large Language Models (LLMs) to generate optimization models from natural language descriptions has became increasingly popular. However, a major open question is how to validate that the generated models are correct and satisfy the requirements defined in the natural language description. In this work, we propose a novel agent-based method for automatic validation of optimization models that builds upon and extends methods from software testing to address optimization modeling . This method consists of several agents that initially generate a problem-level testing API, then generate tests utilizing this API, and, lastly, generate mutations specific to the optimization model (a well-known software testing technique assessing the fault detection power of the test suite). In this work, we detail this validation framework and show, through experiments, the high quality of validation provided by this agent ensemble in terms of the well-known software testing measure called mutation coverage.

artificial intelligence, optimization model, optimization problem, (17 more...)

arXiv.org Artificial Intelligence

2511.16383

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Mutation Testing for Industrial Robotic Systems

Santos, Marcela Gonçalves dos, Hallé, Sylvain, Petrillo, Fábio

arXiv.org Artificial IntelligenceNov-19-2025

Industrial robotic systems (IRS) are increasingly deployed in diverse environments, where failures can result in severe accidents and costly downtime. Ensuring the reliability of the software controlling these systems is therefore critical. Mutation testing, a technique widely used in software engineering, evaluates the effectiveness of test suites by introducing small faults, or mutants, into the code. However, traditional mutation operators are poorly suited to robotic programs, which involve message-based commands and interactions with the physical world. This paper explores the adaptation of mutation testing to IRS by defining domain-specific mutation operators that capture the semantics of robot actions and sensor readings. We propose a methodology for generating meaningful mutants at the level of high-level read and write operations, including movement, gripper actions, and sensor noise injection. An empirical study on a pick-and-place scenario demonstrates that our approach produces more informative mutants and reduces the number of invalid or equivalent cases compared to conventional operators. Results highlight the potential of mutation testing to enhance test suite quality and contribute to safer, more reliable industrial robotic systems.

artificial intelligence, mutation operator, operator, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.4204/EPTCS.436.5

2511.14432

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (0.82)

Industry: Government (0.98)

Technology: Information Technology > Artificial Intelligence > Robots > Robots in the Workplace (1.00)

Add feedback

Bridging the Skills Gap: A Course Model for Modern Generative AI Education

Bardach, Anya, Murrah, Hamilton

arXiv.org Artificial IntelligenceNov-18-2025

Research on how the popularization of generative Artificial Intelligence (AI) tools impacts learning environments has led to hesitancy among educators to teach these tools in classrooms, creating two observed disconnects. Generative AI competency is increasingly valued in industry but not in higher education, and students are experimenting with generative AI without formal guidance. The authors argue students across fields must be taught to responsibly and expertly harness the potential of AI tools to ensure job market readiness and positive outcomes. Computer Science trajectories are particularly impacted, and while consistently top ranked U.S. Computer Science departments teach the mechanisms and frameworks underlying AI, few appear to offer courses on applications for existing generative AI tools. A course was developed at a private research university to teach undergraduate and graduate Computer Science students applications for generative AI tools in software development. Two mixed method surveys indicated students overwhelmingly found the course valuable and effective. Co-authored by the instructor and one of the graduate students, this paper explores the context, implementation, and impact of the course through data analysis and reflections from both perspectives. It additionally offers recommendations for replication in and beyond Computer Science departments. This is the extended version of this paper to include technical appendices.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.11757

Genre:

Instructional Material > Course Syllabus & Notes (1.00)
Questionnaire & Opinion Survey (0.93)
Research Report (0.82)

Industry: Education > Educational Setting > Higher Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

Filters

Collaborating Authors

test suite

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

48db71587df6c7c442e5b76cc723169a-Paper.pdf

Neural Attribution for Semantic Bug-Localization in Student Programs

RankingPolicyDecisions

The Geometry of Benchmarks: A New Path Toward AGI

Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation

LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models

Mutation Testing for Industrial Robotic Systems

Bridging the Skills Gap: A Course Model for Modern Generative AI Education