Goto

Collaborating Authors

 silent failure


Taming Silent Failures: A Framework for Verifiable AI Reliability

arXiv.org Artificial Intelligence

Abstract--The integration of Artificial Intelligence (AI) into safety-critical systems introduces a new reliability paradigm: silent failures, where AI produces confident but incorrect outputs that can be dangerous. This paper introduces the Formal Assurance and Monitoring Environment (FAME), a novel framework that confronts this challenge. FAME synergizes the mathematical rigor of offline formal synthesis with the vigilance of online runtime monitoring to create a verifiable safety net around opaque AI components. We demonstrate its efficacy in an autonomous vehicle perception system, where FAME successfully detected 93.5% of critical safety violations that were otherwise silent. By contextualizing our framework within the ISO 26262 and ISO/P AS 8800 standards, we provide reliability engineers with a practical, certifiable pathway for deploying trustworthy AI. FAME represents a crucial shift from accepting probabilistic performance to enforcing provable safety in next-generation systems. From driver assistance to computer-aided diagnosis (CAD), data-driven components promise superhuman perception and decision support. Y et they also introduce a reliability problem that differs from classical, code-centric software engineering: silent failure, confident outputs that are wrong, with no explicit crash, exception, or error code exposed to the rest of the stack [1], [2]. Safety-critical traditional software is developed under rigorous processes (requirements traceability, design assurance, redundancy, and diagnostics) and can exhibit multiple failure modes (e.g., fail-silent, latent, Byzantine), which are analyzed and mitigated through established standards and verification activities. In contrast, the correctness of learning-enabled components depends on data distributions as much as on code, and can degrade under distribution shift, sensor faults, or occlusions without tripping conventional diagnostics [1]. Standard testing is insufficient, as the input space of production DNNs is hyper-dimensional and cannot be exhaustively exercised [3].


Understanding Silent Failures in Medical Image Classification

arXiv.org Artificial Intelligence

To ensure the reliable use of classification systems in medical applications, it is crucial to prevent silent failures. This can be achieved by either designing classifiers that are robust enough to avoid failures in the first place, or by detecting remaining failures using confidence scoring functions (CSFs). A predominant source of failures in image classification is distribution shifts between training data and deployment data. To understand the current state of silent failure prevention in medical imaging, we conduct the first comprehensive analysis comparing various CSFs in four biomedical tasks and a diverse range of distribution shifts. Based on the result that none of the benchmarked CSFs can reliably prevent silent failures, we conclude that a deeper understanding of the root causes of failures in the data is required. To facilitate this, we introduce SF-Visuals, an interactive analysis tool that uses latent space clustering to visualize shifts and failures. On the basis of various examples, we demonstrate how this tool can help researchers gain insight into the requirements for safe application of classification systems in the medical domain.