Goto

Collaborating Authors

 coverage metric


Fuzzing the brain: Automated stress testing for the safety of ML-driven neurostimulation

Downing, Mara, Peng, Matthew, Granley, Jacob, Beyeler, Michael, Bultan, Tevfik

arXiv.org Artificial Intelligence

Objective: Machine learning (ML) models are increasingly used to generate electrical stimulation patterns in neuroprosthetic devices such as visual prostheses. While these models promise precise and personalized control, they also introduce new safety risks when model outputs are delivered directly to neural tissue. We propose a systematic, quantitative approach to detect and characterize unsafe stimulation patterns in ML-driven neurostimulation systems. Approach: We adapt an automated software testing technique known as coverage-guided fuzzing to the domain of neural stimulation. Here, fuzzing performs stress testing by perturbing model inputs and tracking whether resulting stimulation violates biophysical limits on charge density, instantaneous current, or electrode co-activation. The framework treats encoders as black boxes and steers exploration with coverage metrics that quantify how broadly test cases span the space of possible outputs and violation types. Main results: Applied to deep stimulus encoders for the retina and cortex, the method systematically reveals diverse stimulation regimes that exceed established safety limits. Two violation-output coverage metrics identify the highest number and diversity of unsafe outputs, enabling interpretable comparisons across architectures and training strategies. Significance: Violation-focused fuzzing reframes safety assessment as an empirical, reproducible process. By transforming safety from a training heuristic into a measurable property of the deployed model, it establishes a foundation for evidence-based benchmarking, regulatory readiness, and ethical assurance in next-generation neural interfaces.


BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Andric, Sandro

arXiv.org Artificial Intelligence

Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within approximately 6e-5 on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.


TB or Not TB: Coverage-Driven Direct Preference Optimization for Verilog Stimulus Generation

Nadimi, Bardia, Filom, Khashayar, Chen, Deming, Zheng, Hao

arXiv.org Artificial Intelligence

With the rapid advancement of Large Language Models (LLMs), there is growing interest in applying them to hardware design and verification. Among these stages, design verification remains the most time-consuming and resource-intensive phase, where generating effective stimuli for the design under test (DUT) is both critical and labor-intensive. We present {\it TB or not TB}, a framework for automated stimulus generation using LLMs fine-tuned through Coverage-Driven Direct Preference Optimization (CD-DPO). To enable preference-based training, we introduce PairaNet, a dataset derived from PyraNet that pairs high- and low-quality testbenches labeled using simulation-derived coverage metrics. The proposed CD-DPO method integrates quantitative coverage feedback directly into the optimization objective, guiding the model toward generating stimuli that maximize verification coverage. Experiments on the CVDP CID12 benchmark show that {\it TB or not TB} outperforms both open-source and commercial baselines, achieving up to 77.27\% improvement in code coverage, demonstrating the effectiveness of Coverage-driven preference optimization for LLM-based hardware verification.


Enhanced Generative Model Evaluation with Clipped Density and Coverage

Salvy, Nicolas, Talbot, Hugues, Thirion, Bertrand

arXiv.org Machine Learning

Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by their incapacity to reliably evaluate sample quality. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics, Clipped Density and Clipped Coverage. By clipping individual sample contributions and, for fidelity, the radii of nearest neighbor balls, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics exhibit linear score degradation as the proportion of poor samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that Clipped Density and Clipped Coverage outperform existing methods in terms of robustness, sensitivity, and interpretability for evaluating generative models.


Coverage Metrics for a Scenario Database for the Scenario-Based Assessment of Automated Driving Systems

de Gelder, Erwin, Buermann, Maren, Camp, Olaf Op den

arXiv.org Artificial Intelligence

Automated Driving Systems (ADSs) have the potential to make mobility services available and safe for all. A multi-pillar Safety Assessment Framework (SAF) has been proposed for the type-approval process of ADSs. The SAF requires that the test scenarios for the ADS adequately covers the Operational Design Domain (ODD) of the ADS. A common method for generating test scenarios involves basing them on scenarios identified and characterized from driving data. This work addresses two questions when collecting scenarios from driving data. First, do the collected scenarios cover all relevant aspects of the ADS' ODD? Second, do the collected scenarios cover all relevant aspects that are in the driving data, such that no potentially important situations are missed? This work proposes coverage metrics that provide a quantitative answer to these questions. The proposed coverage metrics are illustrated by means of an experiment in which over 200000 scenarios from 10 different scenario categories are collected from the HighD data set. The experiment demonstrates that a coverage of 100 % can be achieved under certain conditions, and it also identifies which data and scenarios could be added to enhance the coverage outcomes in case a 100 % coverage has not been achieved. Whereas this work presents metrics for the quantification of the coverage of driving data and the identified scenarios, this paper concludes with future research directions, including the quantification of the completeness of driving data and the identified scenarios.


Beyond Random Inputs: A Novel ML-Based Hardware Fuzzing

Rostami, Mohamadreza, Chilese, Marco, Zeitouni, Shaza, Kande, Rahul, Rajendran, Jeyavijayan, Sadeghi, Ahmad-Reza

arXiv.org Artificial Intelligence

Modern computing systems heavily rely on hardware as the root of trust. However, their increasing complexity has given rise to security-critical vulnerabilities that cross-layer at-tacks can exploit. Traditional hardware vulnerability detection methods, such as random regression and formal verification, have limitations. Random regression, while scalable, is slow in exploring hardware, and formal verification techniques are often concerned with manual effort and state explosions. Hardware fuzzing has emerged as an effective approach to exploring and detecting security vulnerabilities in large-scale designs like modern processors. They outperform traditional methods regarding coverage, scalability, and efficiency. However, state-of-the-art fuzzers struggle to achieve comprehensive coverage of intricate hardware designs within a practical timeframe, often falling short of a 70% coverage threshold. We propose a novel ML-based hardware fuzzer, ChatFuzz, to address this challenge. Ourapproach leverages LLMs like ChatGPT to understand processor language, focusing on machine codes and generating assembly code sequences. RL is integrated to guide the input generation process by rewarding the inputs using code coverage metrics. We use the open-source RISCV-based RocketCore processor as our testbed. ChatFuzz achieves condition coverage rate of 75% in just 52 minutes compared to a state-of-the-art fuzzer, which requires a lengthy 30-hour window to reach a similar condition coverage. Furthermore, our fuzzer can attain 80% coverage when provided with a limited pool of 10 simulation instances/licenses within a 130-hour window. During this time, it conducted a total of 199K test cases, of which 6K produced discrepancies with the processor's golden model. Our analysis identified more than 10 unique mismatches, including two new bugs in the RocketCore and discrepancies from the RISC-V ISA Simulator.


Feature Map Testing for Deep Neural Networks

Huang, Dong, Bu, Qingwen, Qing, Yahao, Fu, Yichao, Cui, Heming

arXiv.org Artificial Intelligence

Due to the widespread application of deep neural networks~(DNNs) in safety-critical tasks, deep learning testing has drawn increasing attention. During the testing process, test cases that have been fuzzed or selected using test metrics are fed into the model to find fault-inducing test units (e.g., neurons and feature maps, activating which will almost certainly result in a model error) and report them to the DNN developer, who subsequently repair them~(e.g., retraining the model with test cases). Current test metrics, however, are primarily concerned with the neurons, which means that test cases that are discovered either by guided fuzzing or selection with these metrics focus on detecting fault-inducing neurons while failing to detect fault-inducing feature maps. In this work, we propose DeepFeature, which tests DNNs from the feature map level. When testing is conducted, DeepFeature will scrutinize every internal feature map in the model and identify vulnerabilities that can be enhanced through repairing to increase the model's overall performance. Exhaustive experiments are conducted to demonstrate that (1) DeepFeature is a strong tool for detecting the model's vulnerable feature maps; (2) DeepFeature's test case selection has a high fault detection rate and can detect more types of faults~(comparing DeepFeature to coverage-guided selection techniques, the fault detection rate is increased by 49.32\%). (3) DeepFeature's fuzzer also outperforms current fuzzing techniques and generates valuable test cases more efficiently.


An Overview of Structural Coverage Metrics for Testing Neural Networks

Usman, Muhammad, Sun, Youcheng, Gopinath, Divya, Dange, Rishi, Manolache, Luca, Pasareanu, Corina S.

arXiv.org Artificial Intelligence

Deep neural network (DNN) models, including those used in safety-critical domains, need to be thoroughly tested to ensure that they can reliably perform well in different scenarios. In this article, we provide an overview of structural coverage metrics for testing DNN models, including neuron coverage (NC), k-multisection neuron coverage (kMNC), top-k neuron coverage (TKNC), neuron boundary coverage (NBC), strong neuron activation coverage (SNAC) and modified condition/decision coverage (MC/DC). We evaluate the metrics on realistic DNN models used for perception tasks (including LeNet-1, LeNet-4, LeNet-5, and ResNet20) as well as on networks used in autonomy (TaxiNet). We also provide a tool, DNNCov, which can measure the testing coverage for all these metrics. DNNCov outputs an informative coverage report to enable researchers and practitioners to assess the adequacy of DNN testing, compare different coverage measures, and to more conveniently inspect the model's internals during testing.


Black-Box Testing of Deep Neural Networks through Test Case Diversity

Aghababaeyan, Zohreh, Abdellatif, Manel, Briand, Lionel, S, Ramesh, Bagherzadeh, Mojtaba

arXiv.org Artificial Intelligence

Abstract--Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs. Over the last decade, Deep Neural Networks (DNNs) In fact, in traditional software systems, testers rely on have achieved great performance in many domains, such coverage metrics as they assume that (1) inputs covering as image processing [1], [2], medical diagnostics [3], [4], [5], the same part of the source code are homogeneous, i.e, speech recognition [6] and autonomous driving [7], [8]. However these assumptions break down in critical errors. Therefore, like traditional software, DNN testing as (1) as opposed to code coverage, neuron DNNs need to be tested effectively to ensure their reliability coverage does not necessarily fully exercise the implicit logic and safety. While full coverage does not ensure to validate their proposed criteria [12], [13], [14], [10], [11].


Learning to Visually Navigate in Photorealistic Environments Without any Supervision

Mezghani, Lina, Sukhbaatar, Sainbayar, Szlam, Arthur, Joulin, Armand, Bojanowski, Piotr

arXiv.org Artificial Intelligence

Learning to navigate in a realistic setting where an agent must rely solely on visual inputs is a challenging task, in part because the lack of position information makes it difficult to provide supervision during training. In this paper, we introduce a novel approach for learning to navigate from image inputs without external supervision or reward. Our approach consists of three stages: learning a good representation of first-person views, then learning to explore using memory, and finally learning to navigate by setting its own goals. The model is trained with intrinsic rewards only so that it can be applied to any environment with image observations. We show the benefits of our approach by training an agent to navigate challenging photo-realistic environments from the Gibson dataset with RGB inputs only.