Cabañas, Rafael
How do Machine Learning Models Change?
Castaño, Joel, Cabañas, Rafael, Salmerón, Antonio, Lo, David, Martínez-Fernández, Silverio
The proliferation of Machine Learning (ML) models and their open-source implementations has transformed Artificial Intelligence research and applications. Platforms like Hugging Face (HF) enable the development, sharing, and deployment of these models, fostering an evolving ecosystem. While previous studies have examined aspects of models hosted on platforms like HF, a comprehensive longitudinal study of how these models change remains underexplored. This study addresses this gap by utilizing both repository mining and longitudinal analysis methods to examine over 200,000 commits and 1,200 releases from over 50,000 models on HF. We replicate and extend an ML change taxonomy for classifying commits and utilize Bayesian networks to uncover patterns in commit and release activities over time. Our findings indicate that commit activities align with established data science methodologies, such as CRISP-DM, emphasizing iterative refinement and continuous improvement. Additionally, release patterns tend to consolidate significant updates, particularly in documentation, distinguishing between granular changes and milestone-based releases. Furthermore, projects with higher popularity prioritize infrastructure enhancements early in their lifecycle, and those with intensive collaboration practices exhibit improved documentation standards. These and other insights enhance the understanding of model changes on community platforms and provide valuable guidance for best practices in model maintenance.
Counterfactual Reasoning with Probabilistic Graphical Models for Analyzing Socioecological Systems
Cabañas, Rafael, Maldonado, Ana D., Morales, María, Aguilera, Pedro A., Salmerón, Antonio
Causal and counterfactual reasoning are emerging directions in data science that allow us to reason about hypothetical scenarios. This is particularly useful in domains where experimental data are usually not available. In the context of environmental and ecological sciences, causality enables us, for example, to predict how an ecosystem would respond to hypothetical interventions. A structural causal model is a class of probabilistic graphical models for causality, which, due to its intuitive nature, can be easily understood by experts in multiple fields. However, certain queries, called unidentifiable, cannot be calculated in an exact and precise manner. This paper proposes applying a novel and recent technique for bounding unidentifiable queries within the domain of socioecological systems. Our findings indicate that traditional statistical analysis, including probabilistic graphical models, can identify the influence between variables. However, such methods do not offer insights into the nature of the relationship, specifically whether it involves necessity or sufficiency. This is where counterfactual reasoning becomes valuable.
Efficient Computation of Counterfactual Bounds
Zaffalon, Marco, Antonucci, Alessandro, Cabañas, Rafael, Huber, David, Azzimonti, Dario
We assume to be given structural equations over discrete variables inducing a directed acyclic graph, namely, a structural causal model, together with data about its internal nodes. The question we want to answer is how we can compute bounds for partially identifiable counterfactual queries from such an input. We start by giving a map from structural casual models to credal networks. This allows us to compute exact counterfactual bounds via algorithms for credal nets on a subclass of structural causal models. Exact computation is going to be inefficient in general given that, as we show, causal inference is NP-hard even on polytrees. We target then approximate bounds via a causal EM scheme. We evaluate their accuracy by providing credible intervals on the quality of the approximation; we show through a synthetic benchmark that the EM scheme delivers accurate results in a fair number of runs. In the course of the discussion, we also point out what seems to be a neglected limitation to the trending idea that counterfactual bounds can be computed without knowledge of the structural equations. We also present a real case study on palliative care to show how our algorithms can readily be used for practical purposes.
Approximating Counterfactual Bounds while Fusing Observational, Biased and Randomised Data Sources
Zaffalon, Marco, Antonucci, Alessandro, Cabañas, Rafael, Huber, David
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation scheme to approximate the bounds for partially identifiable counterfactual queries, which are the focus of this paper. We then show how the same approach can address the general case of multiple datasets, no matter whether interventional or observational, biased or unbiased, by remapping it into the former one via graphical transformations. Systematic numerical experiments and a case study on palliative care show the effectiveness of our approach, while hinting at the benefits of fusing heterogeneous data sources to get informative outcomes in case of partial identifiability.
CREPO: An Open Repository to Benchmark Credal Network Algorithms
Cabañas, Rafael, Antonucci, Alessandro
Credal networks are a popular class of imprecise probabilistic graphical models obtained as a Bayesian network generalization based on, so-called credal, sets of probability mass functions. A Java library called CREMA has been recently released to model, process and query credal networks. Despite the NP-hardness of the (exact) task, a number of algorithms is available to approximate credal network inferences. In this paper we present CREPO, an open repository of synthetic credal networks, provided together with the exact results of inference tasks on these models. A Python tool is also delivered to load these data and interact with CREMA, thus making extremely easy to evaluate and compare existing and novel inference algorithms. To demonstrate such benchmarking scheme, we propose an approximate heuristic to be used inside variable elimination schemes to keep a bound on the maximum number of vertices generated during the combination step. A CREPO-based validation against approximate procedures based on linearization and exact techniques performed in CREMA is finally discussed.
EM Based Bounding of Unidentifiable Queries in Structural Causal Models
Zaffalon, Marco, Antonucci, Alessandro, Cabañas, Rafael
A structural causal model is made of endogenous (manifest) and exogenous (latent) variables. In a recent paper, it has been shown that endogenous observations induce linear constraints on the probabilities of the exogenous variables. This allows to exactly map a causal model into a \emph{credal network}. Causal inferences, such as interventions and counterfactuals, can consequently be obtained by standard credal network algorithms. These natively return sharp values in the identifiable case, while intervals corresponding to the exact bounds are produced for unidentifiable queries. In this paper we present an approximate characterization of the constraints on the exogenous probabilities. This is based on a specialization of the EM algorithm to the treatment of the missing values in the exogenous observations. Multiple EM runs can be consequently used to describe the causal model as a set of Bayesian networks and, hence, a credal network to be queried for the bounding of unidentifiable queries. Preliminary empirical tests show how this approach might provide good inner bounds with relatively few runs. This is a promising direction for causal analysis in models whose topology prevents a straightforward specification of the credal mapping.
Structural Causal Models Are (Solvable by) Credal Networks
Zaffalon, Marco, Antonucci, Alessandro, Cabañas, Rafael
A structural causal model is made of endogenous (manifest) and exogenous (latent) variables. We show that endogenous observations induce linear constraints on the probabilities of the exogenous variables. This allows to exactly map a causal model into a credal network. Causal inferences, such as interventions and counterfactuals, can consequently be obtained by standard algorithms for the updating of credal nets. These natively return sharp values in the identifiable case, while intervals corresponding to the exact bounds are produced for unidentifiable queries. A characterization of the causal models that allow the map above to be compactly derived is given, along with a discussion about the scalability for general models. This contribution should be regarded as a systematic approach to represent structural causal models by credal networks and hence to systematically compute causal inferences. A number of demonstrative examples is presented to clarify our methodology. Extensive experiments show that approximate algorithms for credal networks can immediately be used to do causal inference in real-size problems.
InferPy: Probabilistic Modeling with Deep Neural Networks Made Easy
Cózar, Javier, Cabañas, Rafael, Salmerón, Antonio, Masegosa, Andrés R.
These can generate data samples using probabilistic constructs that include NNs. This has provoked a strong impact within the deep learning community as it allowed dealing with many unsupervised learning problems. See [2] for a recent review of these models. Along these lines, a new set of software tools have appeared, building on top of standard deep learning frameworks, in order to accommodate probabilistic models containing NNs [5, 3, 4]. These tools usually fall under the umbrella term probabilistic programming languages (PPLs) [7], and provide support for methods for reasoning about complex probabilistic models. Some examples are Edward2/TFP [8, 3], Pyro [4], etc. 3. Software Framework The main features of InferPy are: (i) Its simple API allows easy prototyping of probabilistic models including NNs; (ii) Unlike Edward2/TFP, it is not require to have a strong background in the inference methods available (Variational Inference [1, 2] and Monte Carlo methods [9]) as many details are hidden to the user; (iii) Parallelization details are also hidden to the user: InferPy runs seamlessly on CPUs and GPUs. InferPy can be seen as an upper layer for working with Edward2/TFP. Thus, models that can be defined in InferPy are those that can be defined using Edward2/TFP. InferPy is distributed as open-software (Apache-2.0)
Probabilistic Models with Deep Neural Networks
Masegosa, Andrés R., Cabañas, Rafael, Langseth, Helge, Nielsen, Thomas D., Salmerón, Antonio
Recent advances in statistical inference have significantly expanded the toolbox of probabilistic modeling. Historically, probabilistic modeling has been constrained to (i) very restricted model classes where exact or approximate probabilistic inference were feasible, and (ii) small or medium-sized data sets which fit within the main memory of the computer. However, developments in variational inference, a general form of approximate probabilistic inference originated in statistical physics, are allowing probabilistic modeling to overcome these restrictions: (i) Approximate probabilistic inference is now possible over a broad class of probabilistic models containing a large number of parameters, and (ii) scalable inference methods based on stochastic gradient descent and distributed computation engines allow to apply probabilistic modeling over massive data sets. One important practical consequence of these advances is the possibility to include deep neural networks within a probabilistic model to capture complex non-linear stochastic relationships between random variables. These advances in conjunction with the release of novel probabilistic modeling toolboxes have greatly expanded the scope of application of probabilistic models, and allow these models to take advantage of the recent strides made by the deep learning community. In this paper we review the main concepts, methods and tools needed to use deep neural networks within a probabilistic modeling framework.
AMIDST: a Java Toolbox for Scalable Probabilistic Machine Learning
Masegosa, Andrés R., Martínez, Ana M., Ramos-López, Darío, Cabañas, Rafael, Salmerón, Antonio, Nielsen, Thomas D., Langseth, Helge, Madsen, Anders L.
The AMIDST Toolbox is a software for scalable probabilistic machine learning with a spe- cial focus on (massive) streaming data. The toolbox supports a flexible modeling language based on probabilistic graphical models with latent variables and temporal dependencies. The specified models can be learnt from large data sets using parallel or distributed implementa- tions of Bayesian learning algorithms for either streaming or batch data. These algorithms are based on a flexible variational message passing scheme, which supports discrete and continu- ous variables from a wide range of probability distributions. AMIDST also leverages existing functionality and algorithms by interfacing to software tools such as Flink, Spark, MOA, Weka, R and HUGIN. AMIDST is an open source toolbox written in Java and available at http://www.amidsttoolbox.com under the Apache Software License version 2.0.