Decision Tree Learning
LdSM: Logarithm-depth Streaming Multi-label Decision Trees
Majzoubi, Maryam, Choromanska, Anna
We consider multi-label classification where the goal is to annotate each data point with the most relevant $\textit{subset}$ of labels from an extremely large label set. Efficient annotation can be achieved with balanced tree predictors, i.e. trees with logarithmic-depth in the label complexity, whose leaves correspond to labels. Designing prediction mechanism with such trees for real data applications is non-trivial as it needs to accommodate sending examples to multiple leaves while at the same time sustain high prediction accuracy. In this paper we develop the LdSM algorithm for the construction and training of multi-label decision trees, where in every node of the tree we optimize a novel objective function that favors balanced splits, maintains high class purity of children nodes, and allows sending examples to multiple directions but with a penalty that prevents tree over-growth. Each node of the tree is trained once the previous one is completed leading to a streaming approach for training. We analyze the proposed method theoretically and show that minimizing the objective leads to pure and balanced data splits. Furthermore, we prove that optimizing it results in the monotonic decrease of the error with every split. Experimental results on benchmark data sets demonstrate that our approach achieves high prediction accuracy with logarithmic-depth trees and position LdSM as a competitive tool among existing state-of-the-art tree-based approaches in terms of the statistical performance and prediction time.
Nonparametric Functional Approximation with Delaunay Triangulation
We propose a differentiable nonparametric algorithm, the Delaunay triangulation learner (DTL), to solve the functional approximation problem on the basis of a $p$-dimensional feature space. By conducting the Delaunay triangulation algorithm on the data points, the DTL partitions the feature space into a series of $p$-dimensional simplices in a geometrically optimal way, and fits a linear model within each simplex. We study its theoretical properties by exploring the geometric properties of the Delaunay triangulation, and compare its performance with other statistical learners in numerical studies.
Heterogeneous causal effects with imperfect compliance: a novel Bayesian machine learning approach
Bargagli-Stoffi, Falco J., De-Witte, Kristof, Gnecco, Giorgio
This paper introduces an innovative Bayesian machine learning algorithm to draw inference on heterogeneous causal effects in the presence of imperfect compliance (e.g., under an irregular assignment mechanism). We show, through Monte Carlo simulations, that the proposed Bayesian Causal Forest with Instrumental Variable (BCF-IV) algorithm outperforms other machine learning techniques tailored for causal inference (namely, Generalized Random Forest and Causal Trees with Instrumental Variable) in estimating the causal effects. Moreover, we show that it converges to an optimal asymptotic performance in discovering the drivers of heterogeneity in a simulated scenario. BCF-IV sheds a light on the heterogeneity of causal effects in instrumental variable scenarios and, in turn, provides the policy-makers with a relevant tool for targeted policies. Its empirical application evaluates the effects of additional funding on students' performances. The results indicate that BCF-IV could be used to enhance the effectiveness of school funding on students' performance by 3.2 to 3.5 times.
Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting
Mihaita, Adriana-Simona, Liu, Zheyuan, Cai, Chen, Rizoiu, Marian-Andrei
Abstract: Predicting traffic incident duration is a major challenge for many traffic centres around the world. Most research studies focus on predicting the incident duration on motorways rather than arterial roads, due to a high network complexity and lack of data. In this paper we propose a bi-level framework for predicting the accident duration on arterial road networks in Sydney, based on operational requirements of incident clearance target which is less than 45 minutes. Using incident baseline information, we first deploy a classification method using various ensemble tree models in order to predict whether a new incident will be cleared in less than 45min or not. If the incident was classified as short-term, then various regression models are developed for predicting the actual incident duration in minutes by incorporating various traffic flow features. After outlier removal and intensive model hyper-parameter tuning through randomized search and cross-validation, we show that the extreme gradient boost approach outperformed all models, including the gradient-boosted decision-trees by almost 53%. Finally, we perform a feature importance evaluation for incident duration prediction and show that the best prediction results are obtained when leveraging the real-time traffic flow in vicinity road sections to the reported accident location. Initial methods used to predict the incident duration were 1. Introduction Bayesian classifiers [5], discrete choice models (DCM) [6], probabilistic distribution analyses [7], and the hazard-based Traffic congestion is a major concern for many cities duration models (HBDM) [8].
Fairness and Missing Values
Martínez-Plumed, Fernando, Ferri, Cèsar, Nieves, David, Hernández-Orallo, José
The causes underlying unfair decision making are complex, being internalised in different ways by decision makers, other actors dealing with data and models, and ultimately by the individuals being affected by these decisions. One frequent manifestation of all these latent causes arises in the form of missing values: protected groups are more reluctant to give information that could be used against them, delicate information for some groups can be erased by human operators, or data acquisition may simply be less complete and systematic for minority groups. As a result, missing values and bias in data are two phenomena that are tightly coupled. However, most recent techniques, libraries and experimental results dealing with fairness in machine learning have simply ignored missing data. In this paper, we claim that fairness research should not miss the opportunity to deal properly with missing data. To support this claim, (1) we analyse the sources of missing data and bias, and we map the common causes, (2) we find that rows containing missing values are usually fairer than the rest, which should not be treated as the uncomfortable ugly data that different techniques and libraries get rid of at the first occasion, and (3) we study the trade-off between performance and fairness when the rows with missing values are used (either because the technique deals with them directly or by imputation methods). We end the paper with a series of recommended procedures about what to do with missing data when aiming for fair decision making.
Flexible Mining of Prefix Sequences from Time-Series Traces
da Costa, Antonio Anastasio Bruto, Frehse, Goran, Dasgupta, Pallab
Mining temporal assertions from time-series data using information theory to filter real properties from incidental ones is a practically significant challenge. The problem is complex for continuous or hybrid systems because the degrees of influence on a consequent from a timed-sequence of predicates (called its prefix sequence), varies continuously over dense time intervals. We propose a parameterized method that uses interval arithmetic for flexibly learning prefix sequences having influence on a defined consequent over various time scales and predicates over system variables.
Model-Agnostic Counterfactual Explanations for Consequential Decisions
Karimi, Amir-Hossein, Barthe, Gilles, Balle, Borja, Valera, Isabel
Predictive models are being increasingly used to support consequential decision making at the individual level in contexts such as pretrial bail and loan approval. As a result, there is increasing social and legal pressure to provide explanations that help the affected individuals not only to understand why a prediction was output, but also how to act to obtain a desired outcome. To this end, several works have proposed methods to generate counterfactual explanations. However, they are often restricted to a particular subset of models (e.g., decision trees or linear models), and cannot directly handle the mixed (numerical and nominal) nature of the features describing each individual. In this paper, we propose a model-agnostic algorithm to generate counterfactual explanations that builds on the standard theory and tools from formal verification. Specifically, our algorithm solves a sequence of satisfiability problems, where a wide variety of predictive models and distances in mixed feature spaces, as well as natural notions of plausibility and diversity, are represented as logic formulas. Our experiments on real-world data demonstrate that our approach can flexibly handle widely deployed predictive models, while providing meaningfully closer counterfactuals than existing approaches.
Evaluating time series forecasting models: An empirical study on performance estimation methods
Cerqueira, Vitor, Torgo, Luis, Mozetic, Igor
Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. These procedures are part of the pipeline in every machine learning project and are used for assessing the overall generalisation ability of predictive models. In this paper we address the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among observations in time series raises some caveats about the most appropriate way to estimate performance in this type of data and currently there is no settled way to do so. We compare different variants of cross-validation and of out-of-sample approaches using two case studies: One with 62 real-world time series and another with three synthetic time series. Results show noticeable differences in the performance estimation methods in the two scenarios. In particular, empirical experiments suggest that cross-validation approaches can be applied to stationary time series. However, in real-world scenarios, when different sources of non-stationary variation are at play, the most accurate estimates are produced by out-of-sample methods that preserve the temporal order of observations.
On the Art and Science of Machine Learning Explanations
This text discusses several popular explanatory methods that go beyond the error measurements and plots traditionally used to assess machine learning models. Some of the explanatory methods are accepted tools of the trade while others are rigorously derived and backed by long-standing theory. The methods, decision tree surrogate models, individual conditional expectation (ICE) plots, local interpretable model-agnostic explanations (LIME), partial dependence plots, and Shapley explanations, vary in terms of scope, fidelity, and suitable application domain. Along with descriptions of these methods, this text presents real-world usage recommendations supported by a use case and public, in-depth software examples for reproducibility.
Best-scored Random Forest Classification
Hang, Hanyuan, Liu, Xiaoyu, Steinwart, Ingo
We propose an algorithm named best-scored random forest for binary classification problems. The terminology "best-scored" means to select the one with the best empirical performance out of a certain number of purely random tree candidates as each single tree in the forest. In this way, the resulting forest can be more accurate than the original purely random forest. From the theoretical perspective, within the framework of regularized empirical risk minimization penalized on the number of splits, we establish almost optimal convergence rates for the proposed best-scored random trees under certain conditions which can be extended to the best-scored random forest. In addition, we present a counterexample to illustrate that in order to ensure the consistency of the forest, every dimension must have the chance to be split. In the numerical experiments, for the sake of efficiency, we employ an adaptive random splitting criterion. Comparative experiments with other state-of-art classification methods demonstrate the accuracy of our best-scored random forest.