Ensemble Learning
Application of AI in Credit Risk Scoring for Small Business Loans: A case study on how AI-based random forest model improves a Delphi model outcome in the case of Azerbaijani SMEs
The research investigates how the application of a machine-learning random forest model improves the accuracy and precision of a Delphi model. The context of the research is Azerbaijani SMEs and the data for the study has been obtained from a financial institution which had gathered it from the enterprises (as there is no public data on local SMEs, it was not practical to verify the data independently). The research used accuracy, precision, recall and F-1 scores for both models to compare them and run the algorithms in Python. The findings showed that accuracy, precision, recall and F- 1 all improve considerably (from 0.69 to 0.83, from 0.65 to 0.81, from 0.56 to 0.77 and from 0.58 to 0.79, respectively). The implications are that by applying AI models in credit risk modeling, financial institutions can improve the accuracy of identifying potential defaulters which would reduce their credit risk. In addition, an unfair rejection of credit access for SMEs would also go down having a significant contribution to an economic growth in the economy. Finally, such ethical issues as transparency of algorithms and biases in historical data should be taken on board while making decisions based on AI algorithms in order to reduce mechanical dependence on algorithms that cannot be justified in practice.
Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance
Kaźmierczak, Stanisław, Mańdziuk, Jacek
Random forests utilize bootstrap sampling to create an individual training set for each component tree. This involves sampling with replacement, with the number of instances equal to the size of the original training set (N). Research literature indicates that drawing fewer than N observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is called the bootstrap rate (BR). Sampling more than N observations (BR > 1) has been explored in the literature only to a limited extent and has generally proven ineffective. In this paper, we re-examine this approach using 36 diverse datasets and consider BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that such parameterization can result in statistically significant improvements in classification accuracy compared to standard settings (BR 1). Furthermore, we investigate what the optimal BR depends on and conclude that it is more a property of the dataset than a dependence on the random forest hyperparameters. Finally, we develop a binary classifier to predict whether the optimal BR is 1 or > 1 for a given dataset, achieving between 81.88% and 88.81% accuracy, depending on the experiment configuration. Random forest (RF) algorithm, introduced by Breiman (2001), is an ensemble of decision trees (DTs) that collectively make decisions using either majority or soft voting. RF reduces variance, sometimes at the cost of slightly increasing bias, by introducing two sources of randomness.
NRGBoost: Energy-Based Generative Boosted Trees
Despite the rise to dominance of deep learning in unstructured data domains, treebased methods such as Random Forests (RF) and Gradient Boosted Decision Trees (GBDT) are still the workhorses for handling discriminative tasks on tabular data. We explore generative extensions of these popular algorithms with a focus on explicitly modeling the data density (up to a normalization constant), thus enabling other applications besides sampling. As our main contribution we propose an energy-based generative boosting algorithm that is analogous to the second order boosting implemented in popular packages like XGBoost. We show that, despite producing a generative model capable of handling inference tasks over any input variable, our proposed algorithm can achieve similar discriminative performance to GBDT on a number of real world tabular datasets, outperforming alternative generative approaches. At the same time, we show that it is also competitive with neural network based models for sampling. Generative models have achieved tremendous success in computer vision and natural language processing, where the ability to generate synthetic data guided by user prompts opens up many exciting possibilities. While generating synthetic table records does not necessarily enjoy the same wide appeal, this problem has still received considerable attention as a potential avenue for bypassing privacy concerns when sharing data. Estimating the data density, p(x), is another typical application of generative models which enables a host of different use cases that can be particularly interesting for tabular data. Unlike discriminative models which are trained to perform inference over a single target variable, density models can be used more flexibly for inference over different variables or for out of distribution detection. They can also handle inference with missing data in a principled way by marginalizing over unobserved variables. The development of generative models for tabular data has mirrored its progression in computer vision with many of its Deep Learning (DL) approaches being adapted to the tabular domain (Jordon et al., 2018; Xu et al., 2019; Fan et al., 2020; Engelmann & Lessmann, 2021; Zhao et al., 2021; Kotelnikov et al., 2023). Unfortunately, these methods are only useful for sampling as they either don't model the density explicitly or can't evaluate it due to untractable marginalization over high dimensional latent variable spaces.
Comparative study of regression vs pairwise models for surrogate-based heuristic optimisation
Naharro, Pablo S., Toharia, Pablo, LaTorre, Antonio, Peña, José-María
Heuristic optimisation algorithms explore the search space by sampling solutions, evaluating their fitness, and biasing the search in the direction of promising solutions. However, in many cases, this fitness function involves executing expensive computational calculations, drastically reducing the reasonable number of evaluations. In this context, surrogate models have emerged as an excellent alternative to alleviate these computational problems. This paper addresses the formulation of surrogate problems as both regression models that approximate fitness (surface surrogate models) and a novel way to connect classification models (pairwise surrogate models). The pairwise approach can be directly exploited by some algorithms, such as Differential Evolution, in which the fitness value is not actually needed to drive the search, and it is sufficient to know whether a solution is better than another one or not. Based on these modelling approaches, we have conducted a multidimensional analysis of surrogate models under different configurations: different machine learning algorithms (regularised regression, neural networks, decision trees, boosting methods, and random forests), different surrogate strategies (encouraging diversity or relaxing prediction thresholds), and compare them for both surface and pairwise surrogate models. The experimental part of the article includes the benchmark problems already proposed for the SOCO2011 competition in continuous optimisation and a simulation problem included in the recent GECCO2021 Industrial Challenge. This paper shows that the performance of the overall search, when using online machine learning-based surrogate models, depends not only on the accuracy of the predictive model but also on both the kind of bias towards positive or negative cases and how the optimisation uses those predictions to decide whether to execute the actual fitness function.
Minimax Adaptive Boosting for Online Nonparametric Regression
Liautaud, Paul, Gaillard, Pierre, Wintenberger, Olivier
We study boosting for adversarial online nonparametric regression with general convex losses. We first introduce a parameter-free online gradient boosting (OGB) algorithm and show that its application to chaining trees achieves minimax optimal regret when competing against Lipschitz functions. While competing with nonparametric function classes can be challenging, the latter often exhibit local patterns, such as local Lipschitzness, that online algorithms can exploit to improve performance. By applying OGB over a core tree based on chaining trees, our proposed method effectively competes against all prunings that align with different Lipschitz profiles and demonstrates optimal dependence on the local regularities. As a result, we obtain the first computationally efficient algorithm with locally adaptive optimal rates for online regression in an adversarial setting.
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much).
Cost efficient gradient boosting
Sven Peter, Ferran Diego, Fred A. Hamprecht, Boaz Nadler
Many applications require learning classifiers or regressors that are both accurate and cheap to evaluate. Prediction cost can be drastically reduced if the learned predictor is constructed such that on the majority of the inputs, it uses cheap features and fast evaluations. The main challenge is to do so with little loss in accuracy. In this work we propose a budget-aware strategy based on deep boosted regression trees. In contrast to previous approaches to learning with cost penalties, our method can grow very deep trees that on average are nonetheless cheap to compute. We evaluate our method on a number of datasets and find that it outperforms the current state of the art by a large margin. Our algorithm is easy to implement and its learning time is comparable to that of the original gradient boosting.
The Effect of Acute Stress on the Interpretability and Generalization of Schizophrenia Predictive Machine Learning Models
Vos, Gideon, Ebrahimpour, Maryam, van Eijk, Liza, Sarnyai, Zoltan, Azghadi, Mostafa Rahimi
Introduction Schizophrenia is a severe mental disorder, and early diagnosis is key to improving outcomes. Its complexity makes predicting onset and progression challenging. EEG has emerged as a valuable tool for studying schizophrenia, with machine learning increasingly applied for diagnosis. This paper assesses the accuracy of ML models for predicting schizophrenia and examines the impact of stress during EEG recording on model performance. We integrate acute stress prediction into the analysis, showing that overlapping conditions like stress during recording can negatively affect model accuracy. Methods Four XGBoost models were built: one for stress prediction, two to classify schizophrenia (at rest and task), and a model to predict schizophrenia for both conditions. XAI techniques were applied to analyze results. Experiments tested the generalization of schizophrenia models using their datasets' healthy controls and independent health-screened controls. The stress model identified high-stress subjects, who were excluded from further analysis. A novel method was used to adjust EEG frequency band power to remove stress artifacts, improving predictive model performance. Results Our results show that acute stress vary across EEG sessions, affecting model performance and accuracy. Generalization improved once these varying stress levels were considered and compensated for during model training. Our findings highlight the importance of thorough health screening and management of the patient's condition during the process. Stress induced during or by the EEG recording can adversely affect model generalization. This may require further preprocessing of data by treating stress as an additional physiological artifact. Our proposed approach to compensate for stress artifacts in EEG data used for training models showed a significant improvement in predictive performance.
Fast nonparametric feature selection with error control using integrated path stability selection
Melikechi, Omar, Dunson, David B., Miller, Jeffrey W.
Feature selection can greatly improve performance and interpretability in machine learning problems. However, existing nonparametric feature selection methods either lack theoretical error control or fail to accurately control errors in practice. Many methods are also slow, especially in high dimensions. In this paper, we introduce a general feature selection method that applies integrated path stability selection to thresholding to control false positives and the false discovery rate. The method also estimates q-values, which are better suited to high-dimensional data than p-values. We focus on two special cases of the general method based on gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive simulations with RNA sequencing data show that IPSSGB and IPSSRF have better error control, detect more true positives, and are faster than existing methods. We also use both methods to detect microRNAs and genes related to ovarian cancer, finding that they make better predictions with fewer features than other methods.
shapiq: Shapley Interactions for Machine Learning
Muschalik, Maximilian, Baniecki, Hubert, Fumagalli, Fabian, Kolpaczki, Patrick, Hammer, Barbara, Hüllermeier, Eyke
Originally rooted in game theory, the Shapley Value (SV) has recently become an important tool in machine learning research. Perhaps most notably, it is used for feature attribution and data valuation in explainable artificial intelligence. Shapley Interactions (SIs) naturally extend the SV and address its limitations by assigning joint contributions to groups of entities, which enhance understanding of black box machine learning models. Due to the exponential complexity of computing SVs and SIs, various methods have been proposed that exploit structural assumptions or yield probabilistic estimates given limited resources. In this work, we introduce shapiq, an open-source Python package that unifies state-of-the-art algorithms to efficiently compute SVs and any-order SIs in an application-agnostic framework. Moreover, it includes a benchmarking suite containing 11 machine learning applications of SIs with pre-computed games and ground-truth values to systematically assess computational performance across domains. For practitioners, shapiq is able to explain and visualize any-order feature interactions in predictions of models, including vision transformers, language models, as well as XGBoost and LightGBM with TreeSHAP-IQ. With shapiq, we extend shap beyond feature attributions and consolidate the application of SVs and SIs in machine learning that facilitates future research. The source code and documentation are available at https://github.com/mmschlk/shapiq.