metamorphic testing
Metamorphic Testing of Multimodal Human Trajectory Prediction
Spieker, Helge, Lazaar, Nadjib, Gotlieb, Arnaud, Belmecheri, Nassim
Context: Predicting human trajectories is crucial for the safety and reliability of autonomous systems, such as automated vehicles and mobile robots. However, rigorously testing the underlying multimodal Human Trajectory Prediction (HTP) models, which typically use multiple input sources (e.g., trajectory history and environment maps) and produce stochastic outputs (multiple possible future paths), presents significant challenges. The primary difficulty lies in the absence of a definitive test oracle, as numerous future trajectories might be plausible for any given scenario. Objectives: This research presents the application of Metamorphic Testing (MT) as a systematic methodology for testing multimodal HTP systems. We address the oracle problem through metamorphic relations (MRs) adapted for the complexities and stochastic nature of HTP. Methods: We present five MRs, targeting transformations of both historical trajectory data and semantic segmentation maps used as an environmental context. These MRs encompass: 1) label-preserving geometric transformations (mirroring, rotation, rescaling) applied to both trajectory and map inputs, where outputs are expected to transform correspondingly. 2) Map-altering transformations (changing semantic class labels, introducing obstacles) with predictable changes in trajectory distributions. We propose probabilistic violation criteria based on distance metrics between probability distributions, such as the Wasserstein or Hellinger distance. Conclusion: This study introduces tool, a MT framework for the oracle-less testing of multimodal, stochastic HTP systems. It allows for assessment of model robustness against input transformations and contextual changes without reliance on ground-truth trajectories.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (7 more...)
- Information Technology (0.69)
- Transportation > Ground > Road (0.68)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
A Digital Twin Framework for Metamorphic Testing of Autonomous Driving Systems Using Generative Model
Zhang, Tony, Kantarci, Burak, Siddique, Umair
Ensuring the safety of self-driving cars remains a major challenge due to the complexity and unpredictability of real-world driving environments. Traditional testing methods face significant limitations, such as the oracle problem, which makes it difficult to determine whether a system's behavior is correct, and the inability to cover the full range of scenarios an autonomous vehicle may encounter. In this paper, we introduce a digital twin-driven metamorphic testing framework that addresses these challenges by creating a virtual replica of the self-driving system and its operating environment. By combining digital twin technology with AI-based image generative models such as Stable Diffusion, our approach enables the systematic generation of realistic and diverse driving scenes. This includes variations in weather, road topology, and environmental features, all while maintaining the core semantics of the original scenario. The digital twin provides a synchronized simulation environment where changes can be tested in a controlled and repeatable manner. Within this environment, we define three metamorphic relations inspired by real-world traffic rules and vehicle behavior. We validate our framework in the Udacity self-driving simulator and demonstrate that it significantly enhances test coverage and effectiveness. Our method achieves the highest true positive rate (0.719), F1 score (0.689), and precision (0.662) compared to baseline approaches. This paper highlights the value of integrating digital twins with AI-powered scenario generation to create a scalable, automated, and high-fidelity testing solution for autonomous vehicle safety.
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (1.00)
- Information Technology > Robotics & Automation (0.88)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models
de Oliveira, Matheus Vinicius da Silva, Silva, Jonathan de Andrade, Fontao, Awdren de Lima
Large Language Models (LLMs) are widely used across multiple domains but continue to raise concerns regarding security and fairness. Beyond known attack vectors such as data poisoning and prompt injection, LLMs are also vulnerable to fairness bugs. These refer to unintended behaviors influenced by sensitive demographic cues (e.g., race or sexual orientation) that should not affect outcomes. Another key issue is hallucination, where models generate plausible yet false information. Retrieval-Augmented Generation (RAG) has emerged as a strategy to mitigate hallucinations by combining external retrieval with text generation. However, its adoption raises new fairness concerns, as the retrieved content itself may surface or amplify bias. This study conducts fairness testing through metamorphic testing (MT), introducing controlled demographic perturbations in prompts to assess fairness in sentiment analysis performed by three Small Language Models (SLMs) hosted on HuggingFace (Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3.1-Nemotron-8B), each integrated into a RAG pipeline. Results show that minor demographic variations can break up to one third of metamorphic relations (MRs). A detailed analysis of these failures reveals a consistent bias hierarchy, with perturbations involving racial cues being the predominant cause of the violations. In addition to offering a comparative evaluation, this work reinforces that the retrieval component in RAG must be carefully curated to prevent bias amplification. The findings serve as a practical alert for developers, testers and small organizations aiming to adopt accessible SLMs without compromising fairness or reliability.
An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software
Gogani-Khiabani, Sina, Trivedi, Ashutosh, Saha, Diptikalyan, Tizpaz-Niari, Saeid
Large language models (LLMs) show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher-order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error-prone, we use an LLM-driven, role-based framework to automate test generation and code synthesis. We implement a multi-agent system that translates tax code into executable software and incorporates a metamorphic-testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT-4o-mini) achieves a worst-case pass rate of 45%, outperforming frontier models (GPT-4o and Claude 3.5, 9-15%) on complex tax-code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal-critical software from natural-language specifications.
- North America > United States > Colorado > Boulder County > Boulder (0.14)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
- Law > Taxation Law (1.00)
- Government > Tax (1.00)
- Government > Regional Government > North America Government > United States Government (0.48)
Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning
Racharak, Teeradaj, Ragkhitwetsagul, Chaiyong, Sontesadisai, Chommakorn, Sunetnanta, Thanwadee
In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling them to perform new tasks based on a few provided examples without explicit fine-tuning. Despite their impressive adaptability, these models remain vulnerable to subtle adversarial perturbations and exhibit unpredictable behavior when faced with linguistic variations. Inspired by software testing principles, we introduce a software testing-inspired framework, called MMT4NL, for evaluating the trustworthiness of in-context learning by utilizing adversarial perturbations and software testing techniques. It includes diverse evaluation aspects of linguistic capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around the idea of crafting metamorphic adversarial examples from a test set in order to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is to treat any LLM as software and validate its functionalities just like testing the software. Finally, we demonstrate applications of MMT4NL on the sentiment analysis and question-answering tasks. Our experiments could reveal various linguistic bugs in state-of-the-art LLMs.
- North America > United States (1.00)
- Asia (1.00)
Metamorphic Testing of Deep Code Models: A Systematic Literature Review
Asgari, Ali, de Koning, Milan, Derakhshanfar, Pouria, Panichella, Annibale
Large language models and deep learning models designed for code intelligence have revolutionized the software engineering field due to their ability to perform various code-related tasks. These models can process source code and software artifacts with high accuracy in tasks such as code completion, defect detection, and code summarization; therefore, they can potentially become an integral part of modern software engineering practices. Despite these capabilities, robustness remains a critical quality attribute for deep-code models as they may produce different results under varied and adversarial conditions (e.g., variable renaming). Metamorphic testing has become a widely used approach to evaluate models' robustness by applying semantic-preserving transformations to input programs and analyzing the stability of model outputs. While prior research has explored testing deep learning models, this systematic literature review focuses specifically on metamorphic testing for deep code models. By studying 45 primary papers, we analyze the transformations, techniques, and evaluation methods used to assess robustness. Our review summarizes the current landscape, identifying frequently evaluated models, programming tasks, datasets, target languages, and evaluation metrics, and highlights key challenges and future directions for advancing the field.
- Europe (1.00)
- North America > United States (0.28)
- Oceania > Australia > Victoria (0.28)
- Overview (1.00)
- Research Report > New Finding (0.67)
ASSURE: Metamorphic Testing for AI-powered Browser Extensions
Gao, Xuanqi, Zhai, Juan, Ma, Shiqing, Xie, Siyi, Shen, Chao
The integration of Large Language Models (LLMs) into browser extensions has revolutionized web browsing, enabling sophisticated functionalities like content summarization, intelligent translation, and context-aware writing assistance. However, these AI-powered extensions introduce unprecedented challenges in testing and reliability assurance. Traditional browser extension testing approaches fail to address the non-deterministic behavior, context-sensitivity, and complex web environment integration inherent to LLM-powered extensions. Similarly, existing LLM testing methodologies operate in isolation from browser-specific contexts, creating a critical gap in effective evaluation frameworks. To bridge this gap, we present ASSURE, a modular automated testing framework specifically designed for AI-powered browser extensions. ASSURE comprises three principal components: (1) a modular test case generation engine that supports plugin-based extension of testing scenarios, (2) an automated execution framework that orchestrates the complex interactions between web content, extension processing, and AI model behavior, and (3) a configurable validation pipeline that systematically evaluates behavioral consistency and security invariants rather than relying on exact output matching. Our evaluation across six widely-used AI browser extensions demonstrates ASSURE's effectiveness, identifying 531 distinct issues spanning security vulnerabilities, metamorphic relation violations, and content alignment problems. ASSURE achieves 6.4x improved testing throughput compared to manual approaches, detecting critical security vulnerabilities within 12.4 minutes on average. This efficiency makes ASSURE practical for integration into development pipelines, offering a comprehensive solution to the unique challenges of testing AI-powered browser extensions.
- North America > United States > District of Columbia > Washington (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
AI-Augmented Metamorphic Testing for Comprehensive Validation of Autonomous Vehicles
Zhang, Tony, Kantarci, Burak, Siddique, Umair
Self-driving cars have the potential to revolutionize transportation, but ensuring their safety remains a significant challenge. These systems must navigate a variety of unexpected scenarios on the road, and their complexity poses substantial difficulties for thorough testing. Conventional testing methodologies face critical limitations, including the oracle problem determining whether the systems behavior is correct and the inability to exhaustively recreate a range of situations a self-driving car may encounter. While Metamorphic Testing (MT) offers a partial solution to these challenges, its application is often limited by simplistic modifications to test scenarios. In this position paper, we propose enhancing MT by integrating AI-driven image generation tools, such as Stable Diffusion, to improve testing methodologies. These tools can generate nuanced variations of driving scenarios within the operational design domain (ODD)for example, altering weather conditions, modifying environmental elements, or adjusting lane markings while preserving the critical features necessary for system evaluation. This approach enables reproducible testing, efficient reuse of test criteria, and comprehensive evaluation of a self-driving systems performance across diverse scenarios, thereby addressing key gaps in current testing practices.
- North America > Canada > Ontario > National Capital Region > Ottawa (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Austria > Vienna (0.04)
- Automobiles & Trucks (1.00)
- Transportation > Ground > Road (0.70)
- Information Technology > Robotics & Automation (0.70)
- Transportation > Passenger (0.54)
Metamorphic Testing for Pose Estimation Systems
Duran, Matias, Laurent, Thomas, Rushe, Ellen, Ventresque, Anthony
Pose estimation systems are used in a variety of fields, from sports analytics to livestock care. Given their potential impact, it is paramount to systematically test their behaviour and potential for failure. This is a complex task due to the oracle problem and the high cost of manual labelling necessary to build ground truth keypoints. This problem is exacerbated by the fact that different applications require systems to focus on different subjects (e.g., human versus animal) or landmarks (e.g., only extremities versus whole body and face), which makes labelled test data rarely reusable. To combat these problems we propose MET-POSE, a metamorphic testing framework for pose estimation systems that bypasses the need for manual annotation while assessing the performance of these systems under different circumstances. MET-POSE thus allows users of pose estimation systems to assess the systems in conditions that more closely relate to their application without having to label an ad-hoc test dataset or rely only on available datasets, which may not be adapted to their application domain. While we define MET-POSE in general terms, we also present a non-exhaustive list of metamorphic rules that represent common challenges in computer vision applications, as well as a specific way to evaluate these rules. We then experimentally show the effectiveness of MET-POSE by applying it to Mediapipe Holistic, a state of the art human pose estimation system, with the FLIC and PHOENIX datasets. With these experiments, we outline numerous ways in which the outputs of MET-POSE can uncover faults in pose estimation systems at a similar or higher rate than classic testing using hand labelled data, and show that users can tailor the rule set they use to the faults and level of accuracy relevant to their application.
Evaluating Human Trajectory Prediction with Metamorphic Testing
Spieker, Helge, Belmecheri, Nassim, Gotlieb, Arnaud, Lazaar, Nadjib
The prediction of human trajectories is important for planning in autonomous systems that act in the real world, e.g. automated driving or mobile robots. Human trajectory prediction is a noisy process, and no prediction does precisely match any future trajectory. It is therefore approached as a stochastic problem, where the goal is to minimise the error between the true and the predicted trajectory. In this work, we explore the application of metamorphic testing for human trajectory prediction. Metamorphic testing is designed to handle unclear or missing test oracles. It is well-designed for human trajectory prediction, where there is no clear criterion of correct or incorrect human behaviour. Metamorphic relations rely on transformations over source test cases and exploit invariants. A setting well-designed for human trajectory prediction where there are many symmetries of expected human behaviour under variations of the input, e.g. mirroring and rescaling of the input data. We discuss how metamorphic testing can be applied to stochastic human trajectory prediction and introduce the Wasserstein Violation Criterion to statistically assess whether a follow-up test case violates a label-preserving metamorphic relation.
- Europe > Austria > Vienna (0.16)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- (4 more...)
- Transportation > Ground > Road (0.35)
- Information Technology > Robotics & Automation (0.35)
- Automobiles & Trucks (0.35)