ale
Analyzing Error Sources in Global Feature Effect Estimation
Heiß, Timo, Bögel, Coco, Bischl, Bernd, Casalicchio, Giuseppe
Global feature effects such as partial dependence (PD) and accumulated local effects (ALE) plots are widely used to interpret black-box models. However, they are only estimates of true underlying effects, and their reliability depends on multiple sources of error. Despite the popularity of global feature effects, these error sources are largely unexplored. In particular, the practically relevant question of whether to use training or holdout data to estimate feature effects remains unanswered. We address this gap by providing a systematic, estimator-level analysis that disentangles sources of bias and variance for PD and ALE. To this end, we derive a mean-squared-error decomposition that separates model bias, estimation bias, model variance, and estimation variance, and analyze their dependence on model characteristics, data selection, and sample size. We validate our theoretical findings through an extensive simulation study across multiple data-generating processes, learners, estimation strategies (training data, validation data, and cross-validation), and sample sizes. Our results reveal that, while using holdout data is theoretically the cleanest, potential biases arising from the training data are empirically negligible and dominated by the impact of the usually higher sample size. The estimation variance depends on both the presence of interactions and the sample size, with ALE being particularly sensitive to the latter. Cross-validation-based estimation is a promising approach that reduces the model variance component, particularly for overfitting models. Our analysis provides a principled explanation of the sources of error in feature effect estimates and offers concrete guidance on choosing estimation strategies when interpreting machine learning models.
Formal Abductive Latent Explanations for Prototype-Based Networks
Soria, Jules, Chihani, Zakaria, Girard-Satabin, Julien, Grastien, Alban, Xu-Darme, Romain, Cancila, Daniela
Case-based reasoning networks are machine-learning models that make predictions based on similarity between the input and prototypical parts of training samples, called prototypes. Such models are able to explain each decision by pointing to the prototypes that contributed the most to the final outcome. As the explanation is a core part of the prediction, they are often qualified as ``interpretable by design". While promising, we show that such explanations are sometimes misleading, which hampers their usefulness in safety-critical contexts. In particular, several instances may lead to different predictions and yet have the same explanation. Drawing inspiration from the field of formal eXplainable AI (FXAI), we propose Abductive Latent Explanations (ALEs), a formalism to express sufficient conditions on the intermediate (latent) representation of the instance that imply the prediction. Our approach combines the inherent interpretability of case-based reasoning models and the guarantees provided by formal XAI. We propose a solver-free and scalable algorithm for generating ALEs based on three distinct paradigms, compare them, and present the feasibility of our approach on diverse datasets for both standard and fine-grained image classification. The associated code can be found at https://github.com/julsoria/ale
The Formalism-Implementation Gap in Reinforcement Learning Research
The last decade has seen an upswing in interest and adoption of reinforcement learning (RL) techniques, in large part due to its demonstrated capabilities at performing certain tasks at "super-human levels". This has incentivized the community to prioritize research that demonstrates RL agent performance, often at the expense of research aimed at understanding their learning dynamics. Performance-focused research runs the risk of overfitting on academic benchmarks -- thereby rendering them less useful -- which can make it difficult to transfer proposed techniques to novel problems. Further, it implicitly diminishes work that does not push the performance-frontier, but aims at improving our understanding of these techniques. This paper argues two points: (i) RL research should stop focusing solely on demonstrating agent capabilities, and focus more on advancing the science and understanding of reinforcement learning; and (ii) we need to be more precise on how our benchmarks map to the underlying mathematical formalisms. We use the popular Arcade Learning Environment (ALE; Bellemare et al., 2013) as an example of a benchmark that, despite being increasingly considered "saturated", can be effectively used for developing this understanding, and facilitating the deployment of RL techniques in impactful real-world problems.
Accelerated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models
Chang, Chih-Yu, Chang, Ming-Chung
Recent advances in supervised learning have driven growing interest in explaining black-box models, particularly by estimating the effects of input variables on model predictions. However, existing approaches often face key limitations, including poor scalability, sensitivity to out-of-distribution sampling, and instability under correlated features. To address these issues, we propose A2D2E, an $\textbf{E}$stimator based on $\textbf{A}$ccelerated $\textbf{A}$ggregated $\textbf{D}$-Optimal $\textbf{D}$esigns. Our method leverages principled experimental design to improve efficiency and robustness in main effect estimation. We establish theoretical guarantees, including convergence and variance reduction, and validate A2D2E through extensive simulations. We further provide the potential of the proposed method with a case study on real data and applications in language models. The code to reproduce the results can be found at https://github.com/cchihyu/A2D2E.
A new classification system of beer categories and styles based on large-scale data mining and self-organizing maps of beer recipes
A data-driven quantitative approach was used to develop a novel classification system for beer categories and styles. Sixty-two thousand one hundred twenty-one beer recipes were mined and analyzed, considering ingredient profiles, fermentation parameters, and recipe vital statistics. Statistical analyses combined with self-organizing maps (SOMs) identified four major superclusters that showed distinctive malt and hop usage patterns, style characteristics, and historical brewing traditions. Cold fermented styles showed a conservative grain and hop composition, whereas hot fermented beers exhibited high heterogeneity, reflecting regional preferences and innovation. This new taxonomy offers a reproducible and objective framework beyond traditional sensory-based classifications, providing brewers, researchers, and educators with a scalable tool for recipe analysis and beer development. The findings in this work provide an understanding of beer diversity and open avenues for linking ingredient usage with fermentation profiles and flavor outcomes.
A Tale of Two Learning Algorithms: Multiple Stream Random Walk and Asynchronous Gossip
Gholami, Peyman, Seferoglu, Hulya
Although gossip and random walk-based learning algorithms are widely known for decentralized learning, there has been limited theoretical and experimental analysis to understand their relative performance for different graph topologies and data heterogeneity. We first design and analyze a random walk-based learning algorithm with multiple streams (walks), which we name asynchronous "Multi-Walk (MW)". We provide a convergence analysis for MW w.r.t iteration (computation), wall-clock time, and communication. We also present a convergence analysis for "Asynchronous Gossip", noting the lack of a comprehensive analysis of its convergence, along with the computation and communication overhead, in the literature. Our results show that MW has better convergence in terms of iterations as compared to Asynchronous Gossip in graphs with large diameters (e.g., cycles), while its relative performance, as compared to Asynchronous Gossip, depends on the number of walks and the data heterogeneity in graphs with small diameters (e.g., complete graphs). In wall-clock time analysis, we observe a linear speed-up with the number of walks and nodes in MW and Asynchronous Gossip, respectively. Finally, we show that MW outperforms Asynchronous Gossip in communication overhead, except in small-diameter topologies with extreme data heterogeneity. These results highlight the effectiveness of each algorithm in different graph topologies and data heterogeneity. Our codes are available for reproducibility.
Statistical inference using machine learning and classical techniques based on accumulated local effects (ALE)
Accumulated Local Effects (ALE) is a model-agnostic approach for global explanations of the results of black-box machine learning (ML) algorithms. There are at least three challenges with conducting statistical inference based on ALE: ensuring the reliability of ALE analyses, especially in the context of small datasets; intuitively characterizing a variable's overall effect in ML; and making robust inferences from ML data analysis. In response, we introduce innovative tools and techniques for statistical inference using ALE, establishing bootstrapped confidence intervals tailored to dataset size and introducing ALE effect size measures that intuitively indicate effects on both the outcome variable scale and a normalized scale. Furthermore, we demonstrate how to use these tools to draw reliable statistical inferences, reflecting the flexible patterns ALE adeptly highlights, with implementations available in the 'ale' package in R. This work propels the discourse on ALE and its applicability in ML and statistical analysis forward, offering practical solutions to prevailing challenges in the field.
DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles
Sultanum, Nicole, Srinivasan, Arjun
Authoring data-driven articles is a complex process requiring authors to not only analyze data for insights but also craft a cohesive narrative that effectively communicates the insights. Text generation capabilities of contemporary large language models (LLMs) present an opportunity to assist the authoring of data-driven articles and expedite the writing process. In this work, we investigate the feasibility and perceived value of leveraging LLMs to support authors of data-driven articles. We designed a prototype system, DataTales, that leverages a LLM to generate textual narratives accompanying a given chart. Using DataTales as a design probe, we conducted a qualitative study with 11 professionals to evaluate the concept, from which we distilled affordances and opportunities to further integrate LLMs as valuable data-driven article authoring assistants.