Goto

Collaborating Authors

 fault model


Testing the Fault-Tolerance of Multi-Sensor Fusion Perception in Autonomous Driving Systems

arXiv.org Artificial Intelligence

High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world autonomous driving scenarios, cameras and LiDAR are subject to various faults, which can probably significantly impact the decision-making and behaviors of ADSs. Existing MSF testing approaches only discovered corner cases that the MSF-based perception cannot accurately detected by MSF-based perception, while lacking research on how sensor faults affect the system-level behaviors of ADSs. To address this gap, we conduct the first exploration of the fault tolerance of MSF perception-based ADS for sensor faults. In this paper, we systematically and comprehensively build fault models for cameras and LiDAR in AVs and inject them into the MSF perception-based ADS to test its behaviors in test scenarios. To effectively and efficiently explore the parameter spaces of sensor fault models, we design a feedback-guided differential fuzzer to discover the safety violations of MSF perception-based ADS caused by the injected sensor faults. We evaluate FADE on the representative and practical industrial ADS, Baidu Apollo. Our evaluation results demonstrate the effectiveness and efficiency of FADE, and we conclude some useful findings from the experimental results. To validate the findings in the physical world, we use a real Baidu Apollo 6.0 EDU autonomous vehicle to conduct the physical experiments, and the results show the practical significance of our findings.


SpikeFI: A Fault Injection Framework for Spiking Neural Networks

arXiv.org Artificial Intelligence

Neuromorphic computing and spiking neural networks (SNNs) are gaining traction across various artificial intelligence (AI) tasks thanks to their potential for efficient energy usage and faster computation speed. This comparative advantage comes from mimicking the structure, function, and efficiency of the biological brain, which arguably is the most brilliant and green computing machine. As SNNs are eventually deployed on a hardware processor, the reliability of the application in light of hardware-level faults becomes a concern, especially for safety- and mission-critical applications. In this work, we propose SpikeFI, a fault injection framework for SNNs that can be used for automating the reliability analysis and test generation. SpikeFI is built upon the SLAYER PyTorch framework with fault injection experiments accelerated on a single or multiple GPUs. It has a comprehensive integrated neuron and synapse fault model library, in accordance to the literature in the domain, which is extendable by the user if needed. It supports: single and multiple faults; permanent and transient faults; specified, random layer-wise, and random network-wise fault locations; and pre-, during, and post-training fault injection. It also offers several optimization speedups and built-in functions for results visualization. SpikeFI is open-source and available for download via GitHub at https://github.com/SpikeFI.


PVF (Parameter Vulnerability Factor): A Scalable Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

arXiv.org Artificial Intelligence

Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.


Exponential Auto-Tuning Fault-Tolerant Control of N Degrees-of-Freedom Manipulators Subject to Torque Constraints

arXiv.org Artificial Intelligence

This paper presents a novel auto-tuning subsystem-based fault-tolerant control (SBFC) system designed for robotic manipulator systems with n degrees of freedom (DoF). It initially proposes a novel model for joint torques, incorporating an actuator fault correction model to account for potential faults and a mathematical saturation function to mitigate issues related to unforeseen excessive torque. This model is designed to prevent the generation of excessive torques even by faulty actuators. Subsequently, a robust subsystem-based adaptive control strategy is proposed to force system states closely along desired trajectories, while tolerating various actuator faults, excessive torques, and unknown modeling errors. Furthermore, optimal SBFC gains are determined by tailoring the JAYA algorithm (JA), a high-performance swarm intelligence technique, standing out for its capacity to optimize without the need for meticulous tuning of algorithm-specific parameters, relying instead on its intrinsic principles. Notably, this control framework ensures uniform exponential stability (UES). The enhancement of accuracy and tracking time for reference trajectories, along with the validation of theoretical assertions, is demonstrated through the presentation of simulation outcomes.


Evaluation of Parameter-based Attacks against Embedded Neural Networks with Laser Injection

arXiv.org Artificial Intelligence

Upcoming certification actions related to the security of machine learning (ML) based systems raise major evaluation challenges that are amplified by the large-scale deployment of models in many hardware platforms. Until recently, most of research works focused on API-based attacks that consider a ML model as a pure algorithmic abstraction. However, new implementation-based threats have been revealed, emphasizing the urgency to propose both practical and simulation-based methods to properly evaluate the robustness of models. A major concern is parameter-based attacks (such as the Bit-Flip Attack - BFA) that highlight the lack of robustness of typical deep neural network models when confronted by accurate and optimal alterations of their internal parameters stored in memory. Setting in a security testing purpose, this work practically reports, for the first time, a successful variant of the BFA on a 32-bit Cortex-M microcontroller using laser fault injection. It is a standard fault injection means for security evaluation, that enables to inject spatially and temporally accurate faults. To avoid unrealistic brute-force strategies, we show how simulations help selecting the most sensitive set of bits from the parameters taking into account the laser fault model.


ISimDL: Importance Sampling-Driven Acceleration of Fault Injection Simulations for Evaluating the Robustness of Deep Learning

arXiv.org Artificial Intelligence

Deep Learning (DL) systems have proliferated in many applications, requiring specialized hardware accelerators and chips. In the nano-era, devices have become increasingly more susceptible to permanent and transient faults. Therefore, we need an efficient methodology for analyzing the resilience of advanced DL systems against such faults, and understand how the faults in neural accelerator chips manifest as errors at the DL application level, where faults can lead to undetectable and unrecoverable errors. Using fault injection, we can perform resilience investigations of the DL system by modifying neuron weights and outputs at the software-level, as if the hardware had been affected by a transient fault. Existing fault models reduce the search space, allowing faster analysis, but requiring a-priori knowledge on the model, and not allowing further analysis of the filtered-out search space. Therefore, we propose ISimDL, a novel methodology that employs neuron sensitivity to generate importance sampling-based fault-scenarios. Without any a-priori knowledge of the model-under-test, ISimDL provides an equivalent reduction of the search space as existing works, while allowing long simulations to cover all the possible faults, improving on existing model requirements. Our experiments show that the importance sampling provides up to 15x higher precision in selecting critical faults than the random uniform sampling, reaching such precision in less than 100 faults. Additionally, we showcase another practical use-case for importance sampling for reliable DNN design, namely Fault Aware Training (FAT). By using ISimDL to select the faults leading to errors, we can insert the faults during the DNN training process to harden the DNN against such faults. Using importance sampling in FAT reduces the overhead required for finding faults that lead to a predetermined drop in accuracy by more than 12x.


Decentralized Source Localization without Sensor Parameters in Wireless Sensor Networks

arXiv.org Machine Learning

This paper studies the source (event) localization problem in decentralized wireless sensor networks (WSNs) under the fault model without knowing the sensor parameters. Event localizations have many applications such as localizing intruders, Wifi hotspots and users, and faults in power systems. Previous studies assume the true knowledge (or good estimates) of sensor parameters (e.g., fault model probability or Region of Influence (ROI) of the source) for source localization. However, we propose two methods to estimate the source location in this paper under the fault model: hitting set approach and feature selection method, which only utilize the noisy data set at the fusion center for estimation of the source location without knowing the sensor parameters. The proposed methods have been shown to localize the source effectively. We also study the lower bound on the sample complexity requirement for hitting set method. These methods have also been extended for multiple sources localizations. In addition, we modify the proposed feature selection approach to use maximum likelihood. Finally, extensive simulations are carried out for different settings (i.e., the number of sensor nodes and sample complexity) to validate our proposed methods in comparison to centroid, maximum likelihood, FTML, SNAP estimators.


FTT-NAS: Discovering Fault-Tolerant Neural Architecture

arXiv.org Machine Learning

With the fast evolvement of embedded deep-learning computing systems, applications powered by deep learning are moving from the cloud to the edge. When deploying neural networks (NNs) onto the devices under complex environments, there are various types of possible faults: soft errors caused by cosmic radiation and radioactive impurities, voltage instability, aging, temperature variations, and malicious attackers. Thus the safety risk of deploying NNs is now drawing much attention. In this paper, after the analysis of the possible faults in various types of NN accelerators, we formalize and implement various fault models from the algorithmic perspective. We propose Fault-Tolerant Neural Architecture Search (FT-NAS) to automatically discover convolutional neural network (CNN) architectures that are reliable to various faults in nowadays devices. Then we incorporate fault-tolerant training (FTT) in the search process to achieve better results, which is referred to as FTT-NAS. Experiments on CIFAR-10 show that the discovered architectures outperform other manually designed baseline architectures significantly, with comparable or fewer floating-point operations (FLOPs) and parameters. Specifically, with the same fault settings, F-FTT-Net discovered under the feature fault model achieves an accuracy of 86.2% (VS. 68.1% achieved by MobileNet-V2), and W-FTT-Net discovered under the weight fault model achieves an accuracy of 69.6% (VS. 60.8% achieved by ResNet-20). By inspecting the discovered architectures, we find that the operation primitives, the weight quantization range, the capacity of the model, and the connection pattern have influences on the fault resilience capability of NN models.


ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection

arXiv.org Machine Learning

The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault injection engine, which can mine situations and faults that maximally impact AV safety, as demonstrated on two industry-grade AV technology stacks (from NVIDIA and Baidu). For example, DriveFI found 561 safety-critical faults in less than 4 hours. In comparison, random injection experiments executed over several weeks could not find any safety-critical faults


Exploiting Shared Resource Dependencies in Spectrum Based Plan Diagnosis

AAAI Conferences

In case of a plan failure, plan-repair is a more promising solution than replanning from scratch. The effectiveness of plan-repair depends on knowledge of which plan action failed and why. Therefore, in this paper, we propose an Extended Spectrum Based Diagnosis approach that efficiently pinpoints failed actions. Unlike Model Based Diagnosis (MBD), it does not require the fault models and behavioral descriptions of actions. Our approach first computes the likelihood of an action being faulty and subsequently proposes optimal probe locations to refine the diagnosis. We also exploit knowledge of plan steps that are instances of the same plan operator to optimize the selection of the most informative diagnostic probes. In this paper, we only focus on diagnostic aspect of plan-repair process.