Ensemble Learning
Machine Learning-Enabled IoT Security: Open Issues and Challenges Under Advanced Persistent Threats
Chen, Zhiyan, Liu, Jinxin, Shen, Yu, Simsek, Murat, Kantarci, Burak, Mouftah, Hussein T., Djukic, Petar
Despite its technological benefits, Internet of Things (IoT) has cyber weaknesses due to the vulnerabilities in the wireless medium. Machine learning (ML)-based methods are widely used against cyber threats in IoT networks with promising performance. Advanced persistent threat (APT) is prominent for cybercriminals to compromise networks, and it is crucial to long-term and harmful characteristics. However, it is difficult to apply ML-based approaches to identify APT attacks to obtain a promising detection performance due to an extremely small percentage among normal traffic. There are limited surveys to fully investigate APT attacks in IoT networks due to the lack of public datasets with all types of APT attacks. It is worth to bridge the state-of-the-art in network attack detection with APT attack detection in a comprehensive review article. This survey article reviews the security challenges in IoT networks and presents the well-known attacks, APT attacks, and threat models in IoT systems. Meanwhile, signature-based, anomaly-based, and hybrid intrusion detection systems are summarized for IoT networks. The article highlights statistical insights regarding frequently applied ML-based methods against network intrusion alongside the number of attacks types detected. Finally, open issues and challenges for common network intrusion and APT attacks are presented for future research.
Distributional Gradient Boosting Machines
März, Alexander, Kneib, Thomas
We present a unified probabilistic gradient boosting framework for regression tasks that models and predicts the entire conditional distribution of a univariate response variable as a function of covariates. Our likelihood-based approach allows us to either model all conditional moments of a parametric distribution, or to approximate the conditional cumulative distribution function via Normalizing Flows. As underlying computational backbones, our framework is based on XGBoost and LightGBM. Modelling and predicting the entire conditional distribution greatly enhances existing tree-based gradient boosting implementations, as it allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. Empirical results show that our framework achieves state-of-the-art forecast accuracy.
Ensembles in Machine Learning
Ensemble methods are well established as an algorithmic cornerstone in machine learning (ML). Just as in real life, in ML a committee of experts will often perform better than an individual provided appropriate care is taken in constituting the committee. Since the earliest days of ML research, a variety of ensemble strategies have been developed with random forests and gradient boosting emerging as leading-edge methods in classification today. It has been recognised since the early days of ML research that ensembles of classifiers can be more accurate than individual models. In ML, ensembles are effectively committees that aggregate the predictions of individual classifiers. They are effective for very much the same reasons a committee of experts works in human decision making, they can bring different expertise to bear and the averaging effect can reduce errors. This article presents a tutorial on the main ensemble methods in use in ML with links to Python notebooks and datasets illustrating these methods in action. The objective is to help practitioners get started with ML ensembles and to provide an insight into when and why ensembles are effective. There have been a lot of developments since then and the ensemble idea is still to the forefront in ML applications. For example, random forests [2] and gradient boosting [7] would be considered among the most powerful methods available to ML practitioners today. The generic ensemble idea is presented in Figure 1. All ensembles are made up of a collection of base classifiers, also known as members or estimators.
GAM(L)A: An econometric model for interpretable Machine Learning
Flachaire, Emmanuel, Hacheme, Gilles, Hué, Sullivan, Laurent, Sébastien
Despite their high predictive performance, random forest and gradient boosting are often considered as black boxes or uninterpretable models which has raised concerns from practitioners and regulators. As an alternative, we propose in this paper to use partial linear models that are inherently interpretable. Specifically, this article introduces GAM-lasso (GAMLA) and GAM-autometrics (GAMA), denoted as GAM(L)A in short. GAM(L)A combines parametric and non-parametric functions to accurately capture linearities and non-linearities prevailing between dependent and explanatory variables, and a variable selection procedure to control for overfitting issues. Estimation relies on a two-step procedure building upon the double residual method. We illustrate the predictive performance and interpretability of GAM(L)A on a regression and a classification problem. The results show that GAM(L)A outperforms parametric models augmented by quadratic, cubic and interaction effects. Moreover, the results also suggest that the performance of GAM(L)A is not significantly different from that of random forest and gradient boosting.
Evaluating Local Model-Agnostic Explanations of Learning to Rank Models with Decision Paths
Rahnama, Amir Hossein Akhavan, Butepage, Judith
Local explanations of learning-to-rank (LTR) models are thought to extract the most important features that contribute to the ranking predicted by the LTR model for a single data point. Evaluating the accuracy of such explanations is challenging since the ground truth feature importance scores are not available for most modern LTR models. In this work, we propose a systematic evaluation technique for explanations of LTR models. Instead of using black-box models, such as neural networks, we propose to focus on tree-based LTR models, from which we can extract the ground truth feature importance scores using decision paths. Once extracted, we can directly compare the ground truth feature importance scores to the feature importance scores generated with explanation techniques. We compare two recently proposed explanation techniques for LTR models when using decision trees and gradient boosting models on the MQ2008 dataset. We show that the explanation accuracy in these techniques can largely vary depending on the explained model and even which data point is explained.
The Yield Curve as a Recession Leading Indicator. An Application for Gradient Boosting and Random Forest
Delgado, Pedro Cadahia, Congregado, Emilio, Golpe, Antonio A., Vides, José Carlos
Most representative decision tree ensemble methods have been used to examine the variable importance of Treasury term spreads to predict US economic recessions with a balance of generating rules for US economic recession detection. A strategy is proposed for training the classifiers with Treasury term spreads data and the results are compared in order to select the best model for interpretability. We also discuss the use of SHapley Additive exPlanations (SHAP) framework to understand US recession forecasts by analyzing feature importance. Consistently with the existing literature we find the most relevant Treasury term spreads for predicting US economic recession and a methodology for detecting relevant rules for economic recession detection. In this case, the most relevant term spread found is 3 month to 6 month, which is proposed to be monitored by economic authorities. Finally, the methodology detected rules with high lift on predicting economic recession that can be used by these entities for this propose. This latter result stands in contrast to a growing body of literature demonstrating that machine learning methods are useful for interpretation comparing many alternative algorithms and we discuss the interpretation for our result and propose further research lines aligned with this work.
Interpretable machine-learning model with a collaborative game approach to predict yields and higher heating value of torrefied biomass
Machine learning models developed to predict energy properties of torrefied biomass. Collaborative game theory adopted to aid interpretability of key variables in torrefaction. Gradient boosting offered the highest prediction accuracy with 22-feature input. Novel framework to explain local and global effects of each feature on torrefaction. Torrefaction is a treatment process for converting biomass to high-quality solid fuels.
Explainable Machine Learning for Predicting Homicide Clearance in the United States
Purpose: To explore the potential of Explainable Machine Learning in the prediction and detection of drivers of cleared homicides at the national- and state-levels in the United States. Methods: First, nine algorithmic approaches are compared to assess the best performance in predicting cleared homicides country-wise, using data from the Murder Accountability Project. The most accurate algorithm among all (XGBoost) is then used for predicting clearance outcomes state-wise. Second, SHAP, a framework for Explainable Artificial Intelligence, is employed to capture the most important features in explaining clearance patterns both at the national and state levels. Results: At the national level, XGBoost demonstrates to achieve the best performance overall. Substantial predictive variability is detected state-wise. In terms of explainability, SHAP highlights the relevance of several features in consistently predicting investigation outcomes. These include homicide circumstances, weapons, victims' sex and race, as well as number of involved offenders and victims. Conclusions: Explainable Machine Learning demonstrates to be a helpful framework for predicting homicide clearance. SHAP outcomes suggest a more organic integration of the two theoretical perspectives emerged in the literature. Furthermore, jurisdictional heterogeneity highlights the importance of developing ad hoc state-level strategies to improve police performance in clearing homicides.