Goto

Collaborating Authors

 regression testing


A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Neural Information Processing Systems

The field of machine programming (MP), the automation of the development of software, is making notable research advances. This is, in part, due to the emergence of a wide range of novel techniques in machine learning. In this paper, we apply MP to the automation of software performance regression testing. A performance regression is a software performance degradation caused by a code change.


A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Neural Information Processing Systems

The field of machine programming (MP), the automation of the development of software, is making notable research advances. This is, in part, due to the emergence of a wide range of novel techniques in machine learning. In this paper, we apply MP to the automation of software performance regression testing. A performance regression is a software performance degradation caused by a code change. We demonstrate AutoPerf's generality and efficacy against 3 types of performance regressions across 10 real performance bugs in 7 benchmark and open-source programs.


Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets

Morishige, Masumi, Koshihara, Ryo

arXiv.org Artificial Intelligence

Reproducibility and reliability remain pressing challenges for generative AI systems whose behavior can drift with each model update or prompt revision. We introduce GPR-bench, a lightweight, extensible benchmark that operationalizes regression testing for general purpose use cases. GPR-bench couples an open, bilingual (English and Japanese) dataset covering eight task categories (e.g., text generation, code generation, and information retrieval) and 10 scenarios in each task categories (80 total test cases for each language) with an automated evaluation pipeline that employs "LLM-as-a-Judge" scoring of correctness and conciseness. Experiments across three recent model versions - gpt-4o-mini, o3-mini, and o4-mini - and two prompt configurations (default versus concise-writing instruction) reveal heterogeneous quality. Our results show that newer models generally improve correctness, but the differences are modest and not statistically significant, suggesting that GPR-bench may not be sufficiently challenging to differentiate between recent model versions. In contrast, the concise-writing instruction significantly enhances conciseness (+12.37 pp, Mann-Whitney U test: p < 0.001, effect size r = 0.2995) with minimal degradations on accuracy (-1.7 pp), demonstrating the effectiveness of prompt engineering. Released under the MIT License, GPR- bench lowers the barrier to initiating reproducibility monitoring and provides a foundation for community-driven extensions, while also raising important considerations about benchmark design for rapidly evolving language models.


A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Neural Information Processing Systems

The field of machine programming (MP), the automation of the development of software, is making notable research advances. This is, in part, due to the emergence of a wide range of novel techniques in machine learning. In this paper, we apply MP to the automation of software performance regression testing. A performance regression is a software performance degradation caused by a code change. We demonstrate AutoPerf's generality and efficacy against 3 types of performance regressions across 10 real performance bugs in 7 benchmark and open-source programs.


Fuzzy Inference System for Test Case Prioritization in Software Testing

Karatayev, Aron, Ogorodova, Anna, Shamoi, Pakizar

arXiv.org Artificial Intelligence

In the realm of software development, testing is crucial for ensuring software quality and adherence to requirements. However, it can be time-consuming and resource-intensive, especially when dealing with large and complex software systems. Test case prioritization (TCP) is a vital strategy to enhance testing efficiency by identifying the most critical test cases for early execution. This paper introduces a novel fuzzy logic-based approach to automate TCP, using fuzzy linguistic variables and expert-derived fuzzy rules to establish a link between test case characteristics and their prioritization. Our methodology utilizes two fuzzy variables - failure rate and execution time - alongside two crisp parameters: Prerequisite Test Case and Recently Updated Flag. Our findings demonstrate the proposed system capacity to rank test cases effectively through experimental validation on a real-world software system. The results affirm the practical applicability of our approach in optimizing the TCP and reducing the resource intensity of software testing.


(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs

Ma, Wanqin, Yang, Chenyang, Kästner, Christian

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.


BotSIM: An End-to-End Bot Simulation Framework for Commercial Task-Oriented Dialog Systems

Wang, Guangsen, Tan, Samson, Joty, Shafiq, Wu, Gang, Au, Jimmy, Hoi, Steven

arXiv.org Artificial Intelligence

We present BotSIM, a data-efficient end-to-end Bot SIMulation toolkit for commercial text-based task-oriented dialog (TOD) systems. BotSIM consists of three major components: 1) a Generator that can infer semantic-level dialog acts and entities from bot definitions and generate user queries via model-based paraphrasing; 2) an agenda-based dialog user Simulator (ABUS) to simulate conversations with the dialog agents; 3) a Remediator to analyze the simulated conversations, visualize the bot health reports and provide actionable remediation suggestions for bot troubleshooting and improvement. We demonstrate BotSIM's effectiveness in end-to-end evaluation, remediation and multi-intent dialog generation via case studies on two commercial bot platforms. BotSIM's "generation-simulation-remediation" paradigm accelerates the end-to-end bot evaluation and iteration process by: 1) reducing manual test cases creation efforts; 2) enabling a holistic gauge of the bot in terms of NLU and end-to-end performance via extensive dialog simulation; 3) improving the bot troubleshooting process with actionable suggestions. A demo of our system can be found at https://tinyurl.com/mryu74cd and a demo video at https://youtu.be/qLi5iSoly30. We have open-sourced the toolkit at https://github.com/salesforce/botsim


Software testing trends: From AI to DevTestOps, what's hot and why

#artificialintelligence

The software development space is extremely volatile and is constantly evolving. In software testing, what works for an organization in the present may not be as effective a few months down the line. As the workloads become more distributed and decentralized, it is harder to test them and ensure quality. Today, organizations require quality at speed. The time it takes for products to reach the market is getting shorter, and testing can sometimes seem more like a hindrance than a necessity.


A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Alam, Mejbah, Gottschlich, Justin, Tatbul, Nesime, Turek, Javier S., Mattson, Tim, Muzahid, Abdullah

Neural Information Processing Systems

The field of machine programming (MP), the automation of the development of software, is making notable research advances. This is, in part, due to the emergence of a wide range of novel techniques in machine learning. In this paper, we apply MP to the automation of software performance regression testing. A performance regression is a software performance degradation caused by a code change. We demonstrate AutoPerf's generality and efficacy against 3 types of performance regressions across 10 real performance bugs in 7 benchmark and open-source programs.


Regression Testing in Era of Internet of Things and Machine Learning: A practical approach by Abhinandan H Patil Blurb Books

#artificialintelligence

Abhinandan H. Patil is Founder and CTO of Technology Firm in India, Karnataka. Before this, he has worked in Wireless Network Software Organization as Lead Software Engineer for close to a decade. He spent 5 years in Research and the output of the Research is available as Book and Thesis in IJSER, USA. He is Active Researcher in the field of Machine Learning, Deep Learning, Data Science, Artificial Intelligence, Regression Testing applied to Networks, Communication and Internet of Things. He is active contributor of Science, Technology, Engineering and Mathematics.