Flach, Peter
Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration
Kull, Meelis, Perello-Nieto, Miquel, Kängsepp, Markus, Filho, Telmo Silva, Song, Hao, Flach, Peter
Class probabilities predicted by most multiclass classifiers are uncalibrated, often tending towards over-confidence. With neural networks, calibration can be improved by temperature scaling, a method to learn a single corrective multiplicative factor for inputs to the last softmax layer. On non-neural models the existing methods apply binary calibration in a pairwise or one-vs-rest fashion. We propose a natively multiclass calibration method applicable to classifiers from any model class, derived from Dirichlet distributions and generalising the beta calibration method from binary classification. It is easily implemented with neural nets since it is equivalent to log-transforming the uncalibrated probabilities, followed by one linear layer and softmax. Experiments demonstrate improved probabilistic predictions according to multiple measures (confidence-ECE, classwise-ECE, log-loss, Brier score) across a wide range of datasets and classifiers. Parameters of the learned Dirichlet calibration map provide insights to the biases in the uncalibrated model.
FACE: Feasible and Actionable Counterfactual Explanations
Poyiadzi, Rafael, Sokol, Kacper, Santos-Rodriguez, Raul, De Bie, Tijl, Flach, Peter
Work in Counterfactual Explanations tends to focus on the principle of ``the closest possible world'' that identifies small changes leading to the desired outcome. In this paper we argue that while this approach might initially seem intuitively appealing it exhibits shortcomings not addressed in the current literature. First, a counterfactual example generated by the state-of-the-art systems is not necessarily representative of the underlying data distribution, and may therefore prescribe unachievable goals(e.g., an unsuccessful life insurance applicant with severe disability may be advised to do more sports). Secondly, the counterfactuals may not be based on a ``feasible path'' between the current state of the subject and the suggested one, making actionable recourse infeasible (e.g., low-skilled unsuccessful mortgage applicants may be told to double their salary, which may be hard without first increasing their skill level). These two shortcomings may render counterfactual explanations impractical and sometimes outright offensive. To address these two major flaws, first of all, we propose a new line of Counterfactual Explanations research aimed at providing actionable and feasible paths to transform a selected instance into one that meets a certain goal. Secondly, we propose FACE: an algorithmically sound way of uncovering these ``feasible paths'' based on the shortest path distances defined via density-weighted metrics. Our approach generates counterfactuals that are coherent with the underlying data distribution and supported by the ``feasible paths'' of change, which are achievable and can be tailored to the problem at hand.
FAT Forensics: A Python Toolbox for Algorithmic Fairness, Accountability and Transparency
Sokol, Kacper, Santos-Rodriguez, Raul, Flach, Peter
Machine learning algorithms can take important decisions, sometimes legally binding, about our everyday life. In most cases, however, these systems and decisions are neither regulated nor certified. Given the potential harm that these algorithms can cause, qualities such as fairness, accountability and transparency of predictive systems are of paramount importance. Recent literature suggested voluntary self-reporting on these aspects of predictive systems -- e.g., data sheets for data sets -- but their scope is often limited to a single component of a machine learning pipeline, and producing them requires manual labour. To resolve this impasse and ensure high-quality, fair, transparent and reliable machine learning systems, we developed an open source toolbox that can inspect selected fairness, accountability and transparency aspects of these systems to automatically and objectively report them back to their engineers and users. We describe design, scope and usage examples of this Python toolbox in this paper. The toolbox provides functionality for inspecting fairness, accountability and transparency of all aspects of the machine learning process: data (and their features), models and predictions. It is available to the public under the BSD 3-Clause open source licence.
HyperStream: a Workflow Engine for Streaming Data
Diethe, Tom, Kull, Meelis, Twomey, Niall, Sokol, Kacper, Song, Hao, Perello-Nieto, Miquel, Tonkin, Emma, Flach, Peter
Journal of Machine Learning Research 1 (2019) 1-48 Submitted 8/19; Published 10/00 HyperStream: a Workflow Engine for Streaming Data Tom Diethe tdiethe@amazon.com Intelligent Systems Laboratory, University of Bristol, BS8 1UB, UK Editor: A. N. Other Abstract This paper describes HyperStream, a large-scale, flexible and robust software package, written in the Python language, for processing streaming data with workflow creation capabilities. HyperStream overcomes the limitations of other computational engines and provides high-level interfaces to execute complex nesting, fusion, and prediction both in online and offline forms in streaming environments. HyperStream is a general purpose tool that is well-suited for the design, development, and deployment of Machine Learning algorithms and predictive models in a wide space of sequential predictive problems. Introduction Scientific workflow systems are designed to compose and execute a series of computational or data manipulation operations (workflow) (Deelman et al., 2009).
Distribution Calibration for Regression
Song, Hao, Diethe, Tom, Kull, Meelis, Flach, Peter
We are concerned with obtaining well-calibrated output distributions from regression models. Such distributions allow us to quantify the uncertainty that the model has regarding the predicted target value. We introduce the novel concept of distribution calibration, and demonstrate its advantages over the existing definition of quantile calibration. We further propose a post-hoc approach to improving the predictions from previously trained regression models, using multi-output Gaussian Processes with a novel Beta link function. The proposed method is experimentally verified on a set of common regression models and shows improvements for both distribution-level and quantile-level calibration.
$\beta^3$-IRT: A New Item Response Model and its Applications
Chen, Yu, Filho, Telmo Silva, Prudêncio, Ricardo B. C., Diethe, Tom, Flach, Peter
Item Response Theory (IRT) aims to assess latent abilities of respondents based on the correctness of their answers in aptitude test items with different difficulty levels. In this paper, we propose the $\beta^3$-IRT model, which models continuous responses and can generate a much enriched family of Item Characteristic Curve (ICC). In experiments we applied the proposed model to data from an online exam platform, and show our model outperforms a more standard 2PL-ND model on all datasets. Furthermore, we show how to apply $\beta^3$-IRT to assess the ability of machine learning classifiers. This novel application results in a new metric for evaluating the quality of the classifier's probability estimates, based on the inferred difficulty and discrimination of data instances.
Non-Parametric Calibration of Probabilistic Regression
Song, Hao, Kull, Meelis, Flach, Peter
The task of calibration is to retrospectively adjust the outputs from a machine learning model to provide better probability estimates on the target variable. While calibration has been investigated thoroughly in classification, it has not yet been well-established for regression tasks. This paper considers the problem of calibrating a probabilistic regression model to improve the estimated probability densities over the real-valued targets. We propose to calibrate a regression model through the cumulative probability density, which can be derived from calibrating a multi-class classifier. We provide three non-parametric approaches to solve the problem, two of which provide empirical estimates and the third providing smooth density estimates. The proposed approaches are experimentally evaluated to show their ability to improve the performance of regression models on the predictive likelihood.
Probabilistic Sensor Fusion for Ambient Assisted Living
Diethe, Tom, Twomey, Niall, Kull, Meelis, Flach, Peter, Craddock, Ian
There is a widely-accepted need to revise current forms of health-care provision, with particular interest in sensing systems in the home. Given a multiple-modality sensor platform with heterogeneous network connectivity, as is under development in the Sensor Platform for HEalthcare in Residential Environment (SPHERE) Interdisciplinary Research Collaboration (IRC), we face specific challenges relating to the fusion of the heterogeneous sensor modalities. We introduce Bayesian models for sensor fusion, which aims to address the challenges of fusion of heterogeneous sensor modalities. Using this approach we are able to identify the modalities that have most utility for each particular activity, and simultaneously identify which features within that activity are most relevant for a given activity. We further show how the two separate tasks of location prediction and activity recognition can be fused into a single model, which allows for simultaneous learning an prediction for both tasks. We analyse the performance of this model on data collected in the SPHERE house, and show its utility. We also compare against some benchmark models which do not have the full structure,and show how the proposed model compares favourably to these methods
Precision-Recall-Gain Curves: PR Analysis Done Right
Flach, Peter, Kull, Meelis
Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier's performance. Perhaps inspired by the many advantages of receiver operating characteristic (ROC) curves and the area under such curves for accuracy-based performance assessment, many researchers have taken to report Precision-Recall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of incoherent scale assumptions -- e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the $F_{\beta}$ score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves conveys an expected $F_1$ score on a harmonic scale, and the convex hull of a Precision-Recall-Gain curve allows us to calibrate the classifier's scores so as to determine, for each operating point on the convex hull, the interval of $\beta$ values for which the point optimises $F_{\beta}$. We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected $F_1$ score than others, and so the use of Precision-Recall-Gain curves will result in better model selection.
Threshold Choice Methods: the Missing Link
Hernández-Orallo, José, Flach, Peter, Ferri, Cèsar
Many performance metrics have been introduced for the evaluation of classification performance, with different origins and niches of application: accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the absolute error, and the Brier score (with its decomposition into refinement and calibration). One way of understanding the relation among some of these metrics is the use of variable operating conditions (either in the form of misclassification costs or class proportions). Thus, a metric may correspond to some expected loss over a range of operating conditions. One dimension for the analysis has been precisely the distribution we take for this range of operating conditions, leading to some important connections in the area of proper scoring rules. However, we show that there is another dimension which has not received attention in the analysis of performance metrics. This new dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the loss of these methods for a uniform range of operating conditions we get the 0-1 loss, the absolute error, the Brier score (mean squared error), the AUC and the refinement loss respectively. This provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation, namely: take a model, apply several threshold choice methods consistent with the information which is (and will be) available about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibration in choosing the threshold choice method.