Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks

Kowalska, Bianka, Kwaśnicka, Halina

arXiv.org Artificial Intelligence 

Artificial intelligence (AI) is increasingly assisting us in a wide range of tasks, from everyday applications like recommendation systems to high-risk domains such as bio-metric recognition, autonomous vehicles, and medical diagnosis [1]. In particular, the rise of transformer-based models, such as those used in natural language processing (NLP), has significantly accelerated AI's adoption and visibility in society, enabling breakthroughs in fields like text generation, translation, and image understanding [2]. The size, complexity, and opacity of deep learning models are growing exponentially, further outpacing the ability of researchers to understand the black box. As deep neural networks are increasingly deployed in real-world applications with more advanced use cases, the impact of AI continues to grow. This growing influence, coupled with the often opaque, black-box nature of most AI systems, has led to a heightened demand for AI models that are both faithful and explainable. The validation of AI's decisions is especially critical in high-risks areas, such as law or medicine [3, 4]. As a result, Explainable AI (XAI) emerged as a direct response to companies' and researchers' demands to interpret, explain and validate neural networks to make AI systems trustworthy. XAI encompasses all methods, approaches and efforts to uncover the reasoning and behavior of artificial intelligence systems [1]. Thus, it is important to establish an understanding of common terms used in the XAI literature, despite the lack of universally accepted definitions.