software testing
Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
Hariharan, Mohanakrishnan, Arvapalli, Satish, Barma, Seshu, Sheela, Evangeline
-- W e present a n approach to software testing automation using Agentic Retrieval - Augmented Generation (RAG) systems for Quality Engineering (QE) artifact creation. We combine autonomous AI agents with hybrid vector - graph knowledge systems to automate test plan, case, and Q E metric generation. The system achieves remarkable accuracy improvements from 65% to 94.8% while ensuring comprehensive document traceability throughout the quality engineering lifecycle. Experimental validat ion of enterprise Corporate Systems Engineering and SAP migration projects demonstrates an 85% reduction in testing timeline, a n 85% improvement in test suite efficiency, and projected 35% cost savings, resulting in a 2 - month acceleration of go - live . Index Terms -- agentic systems, retrieval - augmented generation, software testing, quality engineering, multi - agent orchestration, hybrid vector - graph, test automation, SAP testing, en terprise systems These limitations become particularly pronounced in enterprise software testing, where maintaining traceability between requirements, test cases, and business logic is paramount for regulatory compliance and quality assurance.
Large Language Models for Software Testing: A Research Roadmap
Augusto, Cristian, Bertolino, Antonia, De Angelis, Guglielmo, Lonetti, Francesca, Morán, Jesús
Large Language Models (LLMs) are starting to be profiled as one of the most significant disruptions in the Software Testing field. Specifically, they have been successfully applied in software testing tasks such as generating test code, or summarizing documentation. This potential has attracted hundreds of researchers, resulting in dozens of new contributions every month, hardening researchers to stay at the forefront of the wave. Still, to the best of our knowledge, no prior work has provided a structured vision of the progress and most relevant research trends in LLM-based testing. In this article, we aim to provide a roadmap that illustrates its current state, grouping the contributions into different categories, and also sketching the most promising and active research directions for the field. To achieve this objective, we have conducted a semi-systematic literature review, collecting articles and mapping them into the most prominent categories, reviewing the current and ongoing status, and analyzing the open challenges of LLM-based software testing. Lastly, we have outlined several expected long-term impacts of LLMs over the whole software testing field.
- Oceania > Australia > Victoria > Melbourne (0.28)
- North America > United States > California > Sacramento County > Sacramento (0.14)
- Europe > Austria > Vienna (0.14)
- (51 more...)
- Overview (1.00)
- Research Report > Promising Solution (0.45)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.67)
Navigating the growing field of research on AI for software testing -- the taxonomy for AI-augmented software testing and an ontology-driven literature survey
In industry, software testing is the primary method to verify and validate the functionality, performance, security, usability, and so on, of software-based systems. Test automation has gained increasing attention in industry over the last decade, following decades of intense research into test automation and model-based testing. However, designing, developing, maintaining and evolving test automation is a considerable effort. Meanwhile, AI's breakthroughs in many engineering fields are opening up new perspectives for software testing, for both manual and automated testing. This paper reviews recent research on AI augmentation in software test automation, from no automation to full automation. It also discusses new forms of testing made possible by AI. Based on this, the newly developed taxonomy, ai4st, is presented and used to classify recent research and identify open research questions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
- South America > Brazil > Bahia > Salvador (0.04)
- (25 more...)
- Overview (1.00)
- Research Report > New Finding (0.54)
Breaking Barriers in Software Testing: The Power of AI-Driven Automation
Software testing remains critical for ensuring reliability, yet traditional approaches are slow, costly, and prone to gaps in coverage. This paper presents an AI-driven framework that automates test case generation and validation using natural language processing (NLP), reinforcement learning (RL), and predictive models, embedded within a policy-driven trust and fairness model. The approach translates natural language requirements into executable tests, continuously optimizes them through learning, and validates outcomes with real-time analysis while mitigating bias. Case studies demonstrate measurable gains in defect detection, reduced testing effort, and faster release cycles, showing that AI-enhanced testing improves both efficiency and reliability. By addressing integration and scalability challenges, the framework illustrates how AI can shift testing from a reactive, manual process to a proactive, adaptive system that strengthens software quality in increasingly complex environments.
On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous Vehicles
Zhao, Xingyu, Aghazadeh-Chakherlou, Robab, Cheng, Chih-Hong, Popov, Peter, Strigini, Lorenzo
Scenario-based testing has emerged as a common method for autonomous vehicles (AVs) safety assessment, offering a more efficient alternative to mile-based testing by focusing on high-risk scenarios. However, fundamental questions persist regarding its stopping rules, residual risk estimation, debug effectiveness, and the impact of simulation fidelity on safety claims. This paper argues that a rigorous statistical foundation is essential to address these challenges and enable rigorous safety assurance. By drawing parallels between AV testing and established software testing methods, we identify shared research gaps and reusable solutions. We propose proof-of-concept models to quantify the probability of failure per scenario (\textit{pfs}) and evaluate testing effectiveness under varying conditions. Our analysis reveals that neither scenario-based nor mile-based testing universally outperforms the other. Furthermore, we give an example of formal reasoning about alignment of synthetic and real-world testing outcomes, a first step towards supporting statistically defensible simulation-based safety claims.
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > United Kingdom > England > West Midlands > Coventry (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Transportation > Ground > Road (0.68)
- Automobiles & Trucks (0.68)
- Information Technology > Robotics & Automation (0.46)
The Impact of Software Testing with Quantum Optimization Meets Machine Learning
--Modern software systems' complexity challenges efficient testing, as traditional machine learning (ML) struggles with large test suites. This research presents a hybrid framework integrating Quantum Annealing with ML to optimize test case prioritization in CI/CD pipelines. Leveraging quantum optimization, it achieves a 25% increase in defect detection efficiency and a 30% reduction in test execution time versus classical ML, validated on the Defects4J dataset. A simulated CI/CD environment demonstrates robustness across evolving codebases. Visualizations, including defect heatmaps and performance graphs, enhance interpretability. Software testing is integral to ensuring software quality, accounting for 40-50% of development resources in large-scale systems [1]. The rise of microservices, cloud-native architectures, and continuous integration/continuous deployment (CI/CD) practices has intensified the demand for rapid, reliable testing methods [2].
- North America > United States > Texas (0.04)
- Asia > Middle East > Saudi Arabia (0.04)
- Banking & Finance > Trading (0.68)
- Information Technology (0.68)
Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges
Harman, Mark, O'Hearn, Peter, Sengupta, Shubho
Despite decades of research and practice in automated software testing, several fundamental concepts remain ill-defined and under-explored, yet offer enormous potential real-world impact. We show that these concepts raise exciting new challenges in the context of Large Language Models for software test generation. More specifically, we formally define and investigate the properties of hardening and catching tests. A hardening test is one that seeks to protect against future regressions, while a catching test is one that catches such a regression or a fault in new functionality introduced by a code change. Hardening tests can be generated at any time and may become catching tests when a future regression is caught. We also define and motivate the Catching 'Just-in-Time' (JiTTest) Challenge, in which tests are generated 'just-in-time' to catch new faults before they land into production. We show that any solution to Catching JiTTest generation can also be repurposed to catch latent faults in legacy code. We enumerate possible outcomes for hardening and catching tests and JiTTests, and discuss open research problems, deployment options, and initial results from our work on automated LLM-based hardening at Meta. This paper was written to accompany the keynote by the authors at the ACM International Conference on the Foundations of Software Engineering (FSE) 2025. Author order is alphabetical. The corresponding author is Mark Harman.
- Europe > United Kingdom > England > Greater London > London (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Norway > Central Norway > Trøndelag > Trondheim (0.05)
- (17 more...)
The Potential of LLMs in Automating Software Testing: From Generation to Reporting
Sherifi, Betim, Slhoub, Khaled, Nembhard, Fitzroy
Having a high quality software is essential in software engineering, which requires robust validation and verification processes during testing activities. Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods. Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering, particularly in areas like requirements analysis, test automation, and debugging. This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluations across multiple applications in Python and Java demonstrate the system's high test coverage and efficient operation. This research underscores the potential of LLM-powered agents to streamline software testing workflows while addressing challenges in scalability and accuracy.
- North America > United States > Florida > Brevard County > Melbourne (0.15)
- Europe > Finland > Uusimaa > Helsinki (0.04)
Design choices made by LLM-based test generators prevent them from finding bugs
Mathews, Noble Saji, Nagappan, Meiyappan
There is an increasing amount of research and commercial tools for automated test case generation using Large Language Models (LLMs). This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code. Considering bugs are only exposed by failing test cases, we explore the question: can these tools truly achieve the intended objectives of software testing when their test oracles are designed to pass? Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests. These findings raise important questions about the validity of the design behind LLM-based test generation tools and their impact on software quality and test suite reliability.
- North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
- Asia > Singapore (0.04)
On the Effectiveness of LLMs for Manual Test Verifications
Peixoto, Myron David Lucena Campos, Baia, Davy de Medeiros, Nascimento, Nathalia, Alencar, Paulo, Fonseca, Baldoino, Ribeiro, Márcio
Background: Manual testing is vital for detecting issues missed by automated tests, but specifying accurate verifications is challenging. Aims: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests. Method: We conducted two independent and complementary exploratory studies. The first study involved using 2 closed-source and 6 open-source LLMs to generate verifications for manual test steps and evaluate their similarity to original verifications. The second study involved recruiting software testing professionals to assess their perception and agreement with the generated verifications compared to the original ones. Results: The open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications. However, the agreement level among professional testers was slightly above 40%, indicating both promise and room for improvement. While some LLM-generated verifications were considered better than the originals, there were also concerns about AI hallucinations, where verifications significantly deviated from expectations. Conclusion: We contributed by generating a dataset of 37,040 test verifications using 8 different LLMs. Although the models show potential, the relatively modest 40% agreement level highlights the need for further refinement. Enhancing the accuracy, relevance, and clarity of the generated verifications is crucial to ensure greater reliability in real-world testing scenarios.
- South America > Brazil > Alagoas > Maceió (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (0.95)