Decision Tree Learning
Watermarking Decision Tree Ensembles
Calzavara, Stefano, Cazzaro, Lorenzo, Gera, Donald, Orlando, Salvatore
Protecting the intellectual property of machine learning models is a hot topic and many watermarking schemes for deep neural networks have been proposed in the literature. Unfortunately, prior work largely neglected the investigation of watermarking techniques for other types of models, including decision tree ensembles, which are a state-of-the-art model for classification tasks on non-perceptual data. In this paper, we present the first watermarking scheme designed for decision tree ensembles, focusing in particular on random forest models. We discuss watermark creation and verification, presenting a thorough security analysis with respect to possible attacks. We finally perform an experimental evaluation of the proposed scheme, showing excellent results in terms of accuracy and security against the most relevant threats.
Application of AI in Credit Risk Scoring for Small Business Loans: A case study on how AI-based random forest model improves a Delphi model outcome in the case of Azerbaijani SMEs
The research investigates how the application of a machine-learning random forest model improves the accuracy and precision of a Delphi model. The context of the research is Azerbaijani SMEs and the data for the study has been obtained from a financial institution which had gathered it from the enterprises (as there is no public data on local SMEs, it was not practical to verify the data independently). The research used accuracy, precision, recall and F-1 scores for both models to compare them and run the algorithms in Python. The findings showed that accuracy, precision, recall and F- 1 all improve considerably (from 0.69 to 0.83, from 0.65 to 0.81, from 0.56 to 0.77 and from 0.58 to 0.79, respectively). The implications are that by applying AI models in credit risk modeling, financial institutions can improve the accuracy of identifying potential defaulters which would reduce their credit risk. In addition, an unfair rejection of credit access for SMEs would also go down having a significant contribution to an economic growth in the economy. Finally, such ethical issues as transparency of algorithms and biases in historical data should be taken on board while making decisions based on AI algorithms in order to reduce mechanical dependence on algorithms that cannot be justified in practice.
Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance
Kaลบmierczak, Stanisลaw, Maลdziuk, Jacek
Random forests utilize bootstrap sampling to create an individual training set for each component tree. This involves sampling with replacement, with the number of instances equal to the size of the original training set (N). Research literature indicates that drawing fewer than N observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is called the bootstrap rate (BR). Sampling more than N observations (BR > 1) has been explored in the literature only to a limited extent and has generally proven ineffective. In this paper, we re-examine this approach using 36 diverse datasets and consider BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that such parameterization can result in statistically significant improvements in classification accuracy compared to standard settings (BR 1). Furthermore, we investigate what the optimal BR depends on and conclude that it is more a property of the dataset than a dependence on the random forest hyperparameters. Finally, we develop a binary classifier to predict whether the optimal BR is 1 or > 1 for a given dataset, achieving between 81.88% and 88.81% accuracy, depending on the experiment configuration. Random forest (RF) algorithm, introduced by Breiman (2001), is an ensemble of decision trees (DTs) that collectively make decisions using either majority or soft voting. RF reduces variance, sometimes at the cost of slightly increasing bias, by introducing two sources of randomness.
Comparative study of regression vs pairwise models for surrogate-based heuristic optimisation
Naharro, Pablo S., Toharia, Pablo, LaTorre, Antonio, Peรฑa, Josรฉ-Marรญa
Heuristic optimisation algorithms explore the search space by sampling solutions, evaluating their fitness, and biasing the search in the direction of promising solutions. However, in many cases, this fitness function involves executing expensive computational calculations, drastically reducing the reasonable number of evaluations. In this context, surrogate models have emerged as an excellent alternative to alleviate these computational problems. This paper addresses the formulation of surrogate problems as both regression models that approximate fitness (surface surrogate models) and a novel way to connect classification models (pairwise surrogate models). The pairwise approach can be directly exploited by some algorithms, such as Differential Evolution, in which the fitness value is not actually needed to drive the search, and it is sufficient to know whether a solution is better than another one or not. Based on these modelling approaches, we have conducted a multidimensional analysis of surrogate models under different configurations: different machine learning algorithms (regularised regression, neural networks, decision trees, boosting methods, and random forests), different surrogate strategies (encouraging diversity or relaxing prediction thresholds), and compare them for both surface and pairwise surrogate models. The experimental part of the article includes the benchmark problems already proposed for the SOCO2011 competition in continuous optimisation and a simulation problem included in the recent GECCO2021 Industrial Challenge. This paper shows that the performance of the overall search, when using online machine learning-based surrogate models, depends not only on the accuracy of the predictive model but also on both the kind of bias towards positive or negative cases and how the optimisation uses those predictions to decide whether to execute the actual fitness function.
Label Distribution Learning Forests
Wei Shen, KAI ZHAO, Yilu Guo, Alan L. Yuille
Label distribution learning (LDL) is a general learning framework, which assigns to an instance a distribution over a set of labels rather than a single label or multiple labels. Current LDL methods have either restricted assumptions on the expression form of the label distribution or limitations in representation learning, e.g., to learn deep features in an end-to-end manner. This paper presents label distribution learning forests (LDLFs) - a novel label distribution learning algorithm based on differentiable decision trees, which have several advantages: 1) Decision trees have the potential to model any general form of label distributions by a mixture of leaf node predictions.
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much).
Cost efficient gradient boosting
Sven Peter, Ferran Diego, Fred A. Hamprecht, Boaz Nadler
Many applications require learning classifiers or regressors that are both accurate and cheap to evaluate. Prediction cost can be drastically reduced if the learned predictor is constructed such that on the majority of the inputs, it uses cheap features and fast evaluations. The main challenge is to do so with little loss in accuracy. In this work we propose a budget-aware strategy based on deep boosted regression trees. In contrast to previous approaches to learning with cost penalties, our method can grow very deep trees that on average are nonetheless cheap to compute. We evaluate our method on a number of datasets and find that it outperforms the current state of the art by a large margin. Our algorithm is easy to implement and its learning time is comparable to that of the original gradient boosting.
Ranking Perspective for Tree-based Methods with Applications to Symbolic Feature Selection
Tree-based methods are powerful nonparametric techniques in statistics and machine learning. However, their effectiveness, particularly in finite-sample settings, is not fully understood. Recent applications have revealed their surprising ability to distinguish transformations (which we call symbolic feature selection) that remain obscure under current theoretical understanding. This work provides a finite-sample analysis of tree-based methods from a ranking perspective. We link oracle partitions in tree methods to response rankings at local splits, offering new insights into their finite-sample behavior in regression and feature selection tasks. Building on this local ranking perspective, we extend our analysis in two ways: (i) We examine the global ranking performance of individual trees and ensembles, including Classification and Regression Trees (CART) and Bayesian Additive Regression Trees (BART), providing finite-sample oracle bounds, ranking consistency, and posterior contraction results. (ii) Inspired by the ranking perspective, we propose concordant divergence statistics $\mathcal{T}_0$ to evaluate symbolic feature mappings and establish their properties. Numerical experiments demonstrate the competitive performance of these statistics in symbolic feature selection tasks compared to existing methods.
Maximum Margin Interval Trees
Alexandre Drouin, Toby Hocking, Francois Laviolette
Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets.