Goto

Collaborating Authors

 coverage criteria


A Coverage-Guided Testing Framework for Quantum Neural Networks

Shao, Minqi, Zhao, Jianjun

arXiv.org Artificial Intelligence

Quantum Neural Networks (QNNs) combine quantum computing and neural networks, leveraging quantum properties such as superposition and entanglement to improve machine learning models. These quantum characteristics enable QNNs to potentially outperform classical neural networks in tasks such as quantum chemistry simulations, optimization problems, and quantum-enhanced machine learning. However, they also introduce significant challenges in verifying the correctness and reliability of QNNs. To address this, we propose QCov, a set of test coverage criteria specifically designed for QNNs to systematically evaluate QNN state exploration during testing, focusing on superposition and entanglement. These criteria help detect quantum-specific defects and anomalies. Quantum Neural Networks (QNNs) Cong et al. (2018) represent a significant advancement in computational technology, combining the principles of quantum mechanics with neural network mechanisms. By leveraging quantum properties such as superposition and entanglement, QNNs have the potential to solve complex problems more efficiently than classical neural networks, particularly in areas like image classification Li et al. (2022b); Shi et al. (2023); Henderson et al. (2019); Alam et al. (2021) and sequential data learning Bausch (2020); Yu et al. (2024). Despite this early success, similar to Deep Neural Networks (DNNs) LeCun et al. (1998a); He et al. (2015); Howard et al. (2017), QNNs have been shown to be vulnerable to adversarial Lu et al. (2019) and backdoor attacks Chu et al. (2023a;b), raising concerns about their security and robustness. A recent work, QuanTest Shi et al. (2024), introduced the first adversarial testing framework for QNNs, using an entanglement-guided optimization algorithm to generate adversarial inputs and capture erroneous behaviors. However, QuanTest focuses primarily on individual inputs, lacking a comprehensive evaluation of overall test adequacy for QNNs. Additionally, due to the complexity of the Hilbert space, which grows exponentially with the number of qubits, it is impractical to manually test QNNs thoroughly. This highlights the urgent need for a comprehensive testing framework to assess the test adequacy of QNNs. To ensure system quality, numerous testing techniques have been developed for deep learning (DL) systems Zhang et al. (2020); Wang et al. (2024) and traditional quantum software Wang et al. (2021a;b); Fortunato et al. (2022a); Xia et al. (2024) from various perspectives.


Robust Black-box Testing of Deep Neural Networks using Co-Domain Coverage

Gupta, Aishwarya, Saha, Indranil, Rai, Piyush

arXiv.org Artificial Intelligence

Rigorous testing of machine learning models is necessary for trustworthy deployments. We present a novel black-box approach for generating test-suites for robust testing of deep neural networks (DNNs). Most existing methods create test inputs based on maximizing some "coverage" criterion/metric such as a fraction of neurons activated by the test inputs. Such approaches, however, can only analyze each neuron's behavior or each layer's output in isolation and are unable to capture their collective effect on the DNN's output, resulting in test suites that often do not capture the various failure modes of the DNN adequately. These approaches also require white-box access, i.e., access to the DNN's internals (node activations). We present a novel black-box coverage criterion called Co-Domain Coverage (CDC), which is defined as a function of the model's output and thus takes into account its end-to-end behavior. Subsequently, we develop a new fuzz testing procedure named CoDoFuzz, which uses CDC to guide the fuzzing process to generate a test suite for a DNN. We extensively compare the test suite generated by CoDoFuzz with those generated using several state-of-the-art coverage-based fuzz testing methods for the DNNs trained on six publicly available datasets. Experimental results establish the efficiency and efficacy of CoDoFuzz in generating the largest number of misclassified inputs and the inputs for which the model lacks confidence in its decision.

  Country:
  Genre: Research Report (0.64)
  Industry: Transportation > Air (0.83)

Effective Black Box Testing of Sentiment Analysis Classification Networks

Karbasizadeh, Parsa, Faghih, Fathiyeh, Golshanrad, Pouria

arXiv.org Artificial Intelligence

Transformer-based neural networks have demonstrated remarkable performance in natural language processing tasks such as sentiment analysis. Nevertheless, the issue of ensuring the dependability of these complicated architectures through comprehensive testing is still open. This paper presents a collection of coverage criteria specifically designed to assess test suites created for transformer-based sentiment analysis networks. Our approach utilizes input space partitioning, a black-box method, by considering emotionally relevant linguistic features such as verbs, adjectives, adverbs, and nouns. In order to effectively produce test cases that encompass a wide range of emotional elements, we utilize the k-projection coverage metric. This metric minimizes the complexity of the problem by examining subsets of k features at the same time, hence reducing dimensionality. Large language models are employed to generate sentences that display specific combinations of emotional features. The findings from experiments obtained from a sentiment analysis dataset illustrate that our criteria and generated tests have led to an average increase of 16\% in test coverage. In addition, there is a corresponding average decrease of 6.5\% in model accuracy, showing the ability to identify vulnerabilities. Our work provides a foundation for improving the dependability of transformer-based sentiment analysis systems through comprehensive test evaluation.


DeepKnowledge: Generalisation-Driven Deep Learning Testing

Missaoui, Sondess, Gerasimou, Simos, Matragkas, Nikolaos

arXiv.org Artificial Intelligence

Despite their unprecedented success, DNNs are notoriously fragile to small shifts in data distribution, demanding effective testing techniques that can assess their dependability. Despite recent advances in DNN testing, there is a lack of systematic testing approaches that assess the DNN's capability to generalise and operate comparably beyond data in their training distribution. We address this gap with DeepKnowledge, a systematic testing methodology for DNN-based systems founded on the theory of knowledge generalisation, which aims to enhance DNN robustness and reduce the residual risk of 'black box' models. Conforming to this theory, DeepKnowledge posits that core computational DNN units, termed Transfer Knowledge neurons, can generalise under domain shift. DeepKnowledge provides an objective confidence measurement on testing activities of DNN given data distribution shifts and uses this information to instrument a generalisation-informed test adequacy criterion to check the transfer knowledge capacity of a test set. Our empirical evaluation of several DNNs, across multiple datasets and state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepKnowledge and its ability to support the engineering of more dependable DNNs. We report improvements of up to 10 percentage points over state-of-the-art coverage criteria for detecting adversarial attacks on several benchmarks, including MNIST, SVHN, and CIFAR.


DeepCover: Advancing RNN Test Coverage and Online Error Prediction using State Machine Extraction

Golshanrad, Pouria, Faghih, Fathiyeh

arXiv.org Artificial Intelligence

Recurrent neural networks (RNNs) have emerged as powerful tools for processing sequential data in various fields, including natural language processing and speech recognition. However, the lack of explainability in RNN models has limited their interpretability, posing challenges in understanding their internal workings. To address this issue, this paper proposes a methodology for extracting a state machine (SM) from an RNN-based model to provide insights into its internal function. The proposed SM extraction algorithm was assessed using four newly proposed metrics: Purity, Richness, Goodness, and Scale. The proposed methodology along with its assessment metrics contribute to increasing explainability in RNN models by providing a clear representation of their internal decision making process through the extracted SM. In addition to improving the explainability of RNNs, the extracted SM can be used to advance testing and and monitoring of the primary RNN-based model. To enhance RNN testing, we introduce six model coverage criteria based on the extracted SM, serving as metrics for evaluating the effectiveness of test suites designed to analyze the primary model. We also propose a tree-based model to predict the error probability of the primary model for each input based on the extracted SM. We evaluated our proposed online error prediction approach using the MNIST dataset and Mini Speech Commands dataset, achieving an area under the curve (AUC) exceeding 80\% for the receiver operating characteristic (ROC) chart.


Automatic Generation of Scenarios for System-level Simulation-based Verification of Autonomous Driving Systems

Goyal, Srajan, Griggio, Alberto, Kimblad, Jacob, Tonetta, Stefano

arXiv.org Artificial Intelligence

With increasing complexity of Automated Driving Systems (ADS), ensuring their safety and reliability has become a critical challenge. The Verification and Validation (V&V) of these systems are particularly demanding when AI components are employed to implement perception and/or control functions. In ESA-funded project VIVAS, we developed a generic framework for system-level simulation-based V&V of autonomous systems. The approach is based on a simulation model of the system, an abstract model that describes symbolically the system behavior, and formal methods to generate scenarios and verify the simulation executions. Various coverage criteria can be defined to guide the automated generation of the scenarios. In this paper, we describe the instantiation of the VIVAS framework for an ADS case study. This is based on the integration of CARLA, a widely-used driving simulator, and its ScenarioRunner tool, which enables the creation of diverse and complex driving scenarios. This is also used in the CARLA Autonomous Driving Challenge to validate different ADS agents for perception and control based on AI, shared by the CARLA community. We describe the development of an abstract ADS model and the formulation of a coverage criterion that focuses on the behaviors of vehicles relative to the vehicle with ADS under verification. Leveraging the VIVAS framework, we generate and execute various driving scenarios, thus testing the capabilities of the AI components. The results show the effectiveness of VIVAS in automatically generating scenarios for system-level simulation-based V&V of an automated driving system using CARLA and ScenarioRunner. Therefore, they highlight the potential of the approach as a powerful tool in the future of ADS V&V methodologies.


Revisiting Neuron Coverage for DNN Testing: A Layer-Wise and Distribution-Aware Criterion

Yuan, Yuanyuan, Pang, Qi, Wang, Shuai

arXiv.org Artificial Intelligence

Various deep neural network (DNN) coverage criteria have been proposed to assess DNN test inputs and steer input mutations. The coverage is characterized via neurons having certain outputs, or the discrepancy between neuron outputs. Nevertheless, recent research indicates that neuron coverage criteria show little correlation with test suite quality. In general, DNNs approximate distributions, by incorporating hierarchical layers, to make predictions for inputs. Thus, we champion to deduce DNN behaviors based on its approximated distributions from a layer perspective. A test suite should be assessed using its induced layer output distributions. Accordingly, to fully examine DNN behaviors, input mutation should be directed toward diversifying the approximated distributions. This paper summarizes eight design requirements for DNN coverage criteria, taking into account distribution properties and practical concerns. We then propose a new criterion, NeuraL Coverage (NLC), that satisfies all design requirements. NLC treats a single DNN layer as the basic computational unit (rather than a single neuron) and captures four critical properties of neuron output distributions. Thus, NLC accurately describes how DNNs comprehend inputs via approximated distributions. We demonstrate that NLC is significantly correlated with the diversity of a test suite across a number of tasks (classification and generation) and data formats (image and text). Its capacity to discover DNN prediction errors is promising. Test input mutation guided by NLC results in a greater quality and diversity of exposed erroneous behaviors.


An Overview of Structural Coverage Metrics for Testing Neural Networks

Usman, Muhammad, Sun, Youcheng, Gopinath, Divya, Dange, Rishi, Manolache, Luca, Pasareanu, Corina S.

arXiv.org Artificial Intelligence

Deep neural network (DNN) models, including those used in safety-critical domains, need to be thoroughly tested to ensure that they can reliably perform well in different scenarios. In this article, we provide an overview of structural coverage metrics for testing DNN models, including neuron coverage (NC), k-multisection neuron coverage (kMNC), top-k neuron coverage (TKNC), neuron boundary coverage (NBC), strong neuron activation coverage (SNAC) and modified condition/decision coverage (MC/DC). We evaluate the metrics on realistic DNN models used for perception tasks (including LeNet-1, LeNet-4, LeNet-5, and ResNet20) as well as on networks used in autonomy (TaxiNet). We also provide a tool, DNNCov, which can measure the testing coverage for all these metrics. DNNCov outputs an informative coverage report to enable researchers and practitioners to assess the adequacy of DNN testing, compare different coverage measures, and to more conveniently inspect the model's internals during testing.


Black-Box Testing of Deep Neural Networks through Test Case Diversity

Aghababaeyan, Zohreh, Abdellatif, Manel, Briand, Lionel, S, Ramesh, Bagherzadeh, Mojtaba

arXiv.org Artificial Intelligence

Abstract--Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs. Over the last decade, Deep Neural Networks (DNNs) In fact, in traditional software systems, testers rely on have achieved great performance in many domains, such coverage metrics as they assume that (1) inputs covering as image processing [1], [2], medical diagnostics [3], [4], [5], the same part of the source code are homogeneous, i.e, speech recognition [6] and autonomous driving [7], [8]. However these assumptions break down in critical errors. Therefore, like traditional software, DNN testing as (1) as opposed to code coverage, neuron DNNs need to be tested effectively to ensure their reliability coverage does not necessarily fully exercise the implicit logic and safety. While full coverage does not ensure to validate their proposed criteria [12], [13], [14], [10], [11].


Distribution Awareness for AI System Testing

Berend, David

arXiv.org Artificial Intelligence

As Deep Learning (DL) is continuously adopted in many safety critical applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce risks after deployment. Although recent progress has been made in designing novel testing techniques for DL software, the distribution of generated test data is not taken into consideration. It is therefore hard to judge whether the identified errors are indeed meaningful errors to the DL application. Therefore, we propose a new OOD-guided testing technique which aims to generate new unseen test cases relevant to the underlying DL system task. Our results show that this technique is able to filter up to 55.44% of error test case on CIFAR-10 and is 10.05% more effective in enhancing robustness.