Ensemble Learning
XGBOOST -- IN A NUTSHELL
XGBoost stands for "Extreme Gradient Boosting". It is a decision tree-based algorithm which is used in Machine Learning. XGBoost makes use of the gradient boosting framework. Decision Trees are structures that consists of a set of leaf nodes, branches as well as internal nodes. Each leaf node represents a Class Label. The internal node represents the attributes while the branches connect the leaves to these internal nodes.
An overview of active learning methods for insurance with fairness appreciation
Elie, Romuald, Hillairet, Caroline, Hu, François, Juillard, Marc
This paper addresses and solves some challenges in the adoption of machine learning in insurance with the democratization of model deployment. The first challenge is reducing the labelling effort (hence focusing on the data quality) with the help of active learning, a feedback loop between the model inference and an oracle: as in insurance the unlabeled data is usually abundant, active learning can become a significant asset in reducing the labelling cost. For that purpose, this paper sketches out various classical active learning methodologies before studying their empirical impact on both synthetic and real datasets. Another key challenge in insurance is the fairness issue in model inferences. We will introduce and integrate a post-processing fairness for multi-class tasks in this active learning framework to solve these two issues. Finally numerical experiments on unfair datasets highlight that the proposed setup presents a good compromise between model precision and fairness.
Identification of Twitter Bots Based on an Explainable Machine Learning Framework: The US 2020 Elections Case Study
Shevtsov, Alexander, Tzagkarakis, Christos, Antonakaki, Despoina, Ioannidis, Sotiris
Twitter is one of the most popular social networks attracting millions of users, while a considerable proportion of online discourse is captured. It provides a simple usage framework with short messages and an efficient application programming interface (API) enabling the research community to study and analyze several aspects of this social network. However, the Twitter usage simplicity can lead to malicious handling by various bots. The malicious handling phenomenon expands in online discourse, especially during the electoral periods, where except the legitimate bots used for dissemination and communication purposes, the goal is to manipulate the public opinion and the electorate towards a certain direction, specific ideology, or political party. This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data. To this end, a supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm, where the hyper-parameters are tuned via cross-validation. Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions by calculating feature importance, using the game theoretic-based Shapley values. Experimental evaluation on distinct Twitter datasets demonstrate the superiority of our approach, in terms of bot detection accuracy, when compared against a recent state-of-the-art Twitter bot detection method.
Machine Learning-based Prediction of Porosity for Concrete Containing Supplementary Cementitious Materials
Porosity has been identified as the key indicator of the durability properties of concrete exposed to aggressive environments. This paper applies ensemble learning to predict porosity of high-performance concrete containing supplementary cementitious materials. The concrete samples utilized in this study are characterized by eight composition features including w/b ratio, binder content, fly ash, GGBS, superplasticizer, coarse/fine aggregate ratio, curing condition and curing days. The assembled database consists of 240 data records, featuring 74 unique concrete mixture designs. The proposed machine learning algorithms are trained on 180 observations (75%) chosen randomly from the data set and then tested on the remaining 60 observations (25%). The numerical experiments suggest that the regression tree ensembles can accurately predict the porosity of concrete from its mixture compositions. Gradient boosting trees generally outperforms random forests in terms of prediction accuracy. For random forests, the out-of-bag error based hyperparameter tuning strategy is found to be much more efficient than k-Fold Cross-Validation.
Artificial Intelligence and Design of Experiments for Assessing Security of Electricity Supply: A Review and Strategic Outlook
Priesmann, Jan, Münch, Justin, Ridha, Elias, Spiegel, Thomas, Reich, Marius, Adam, Mario, Nolting, Lars, Praktiknjo, Aaron
Assessing the effects of the energy transition and liberalization of energy markets on resource adequacy is an increasingly important and demanding task. The rising complexity in energy systems requires adequate methods for energy system modeling leading to increased computational requirements. Furthermore, with complexity, uncertainty increases likewise calling for probabilistic assessments and scenario analyses. To adequately and efficiently address these various requirements, new methods from the field of data science are needed to accelerate current methods. With our systematic literature review, we want to close the gap between the three disciplines (1) assessment of security of electricity supply, (2) artificial intelligence, and (3) design of experiments. For this, we conduct a large-scale quantitative review on selected fields of application and methods and make a synthesis that relates the different disciplines to each other. Among other findings, we identify metamodeling of complex security of electricity supply models using AI methods and applications of AI-based methods for forecasts of storage dispatch and (non-)availabilities as promising fields of application that have not sufficiently been covered, yet. We end with deriving a new methodological pipeline for adequately and efficiently addressing the present and upcoming challenges in the assessment of security of electricity supply.
Confidence intervals for the random forest generalization error
How confident can we be in the generalization capacity of a predictive model? Of the many devices discussed in the statistical learning literature [1, 2, 3], a simple random split of the original data into training and test sets, and methods of folded cross-validation, stand out as the most common tools used to tackle the generalization issue. Availability of point estimates for the generalization error given by these procedures naturally raises the question of how to quantify the uncertainty involved in these estimates spending a manageable computational cost. Random forests [4] elegantly provide an alternative low cost (almost free) point estimate of the generalization error without requiring splittings of the data, and avoiding the computational burden of retraining the predictive model several times. The bagging mechanism [5] used to construct the ensemble of trees implies that each training data point is not used (stays "out-of-bag") when growing approximately 36.8% of the trees in the forest. This property gives us the so called out-of-bag estimate of the random forest generalization error: for each observation, using a suitable loss function, we compute the predictive error made by the random subforest whose trees didn't include the observation under consideration in its training process; the out-of-bag estimate is the average of these prediction errors over the whole training sample. 1
Efficient Batch Homomorphic Encryption for Vertically Federated XGBoost
More and more orgainizations and institutions make efforts on using external data to improve the performance of AI services. To address the data privacy and security concerns, federated learning has attracted increasing attention from both academia and industry to securely construct AI models across multiple isolated data providers. In this paper, we studied the efficiency problem of adapting widely used XGBoost model in real-world applications to vertical federated learning setting. State-of-the-art vertical federated XGBoost frameworks requires large number of encryption operations and ciphertext transmissions, which makes the model training much less efficient than training XGBoost models locally. To bridge this gap, we proposed a novel batch homomorphic encryption method to cut the cost of encryption-related computation and transmission in nearly half. This is achieved by encoding the first-order derivative and the second-order derivative into a single number for encryption, ciphertext transmission, and homomorphic addition operations. The sum of multiple first-order derivatives and second-order derivatives can be simultaneously decoded from the sum of encoded values. We are motivated by the batch idea in the work of BatchCrypt for horizontal federated learning, and design a novel batch method to address the limitations of allowing quite few number of negative numbers. The encode procedure of the proposed batch method consists of four steps, including shifting, truncating, quantizing and batching, while the decoding procedure consists of de-quantization and shifting back. The advantages of our method are demonstrated through theoretical analysis and extensive numerical experiments.
A Comparative Analysis of Machine Learning Techniques for IoT Intrusion Detection
Vitorino, João, Andrade, Rui, Praça, Isabel, Sousa, Orlando, Maia, Eva
The digital transformation faces tremendous security challenges. In particular, the growing number of cyber-attacks targeting Internet of Things (IoT) systems restates the need for a reliable detection of malicious network activity. This paper presents a comparative analysis of supervised, unsupervised and reinforcement learning techniques on nine malware captures of the IoT-23 dataset, considering both binary and multi-class classification scenarios. The developed models consisted of Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Isolation Forest (iForest), Local Outlier Factor (LOF) and a Deep Reinforcement Learning (DRL) model based on a Double Deep Q-Network (DDQN), adapted to the intrusion detection context. The most reliable performance was achieved by LightGBM. Nonetheless, iForest displayed good anomaly detection results and the DRL model demonstrated the possible benefits of employing this methodology to continuously improve the detection. Overall, the obtained results indicate that the analyzed techniques are well suited for IoT intrusion detection.
Efficient Batch Homomorphic Encryption for Vertically Federated XGBoost
Xu, Wuxing, Fan, Hao, Li, Kaixin, Yang, Kai
More and more orgainizations and institutions make efforts on using external data to improve the performance of AI services. To address the data privacy and security concerns, federated learning has attracted increasing attention from both academia and industry to securely construct AI models across multiple isolated data providers. In this paper, we studied the efficiency problem of adapting widely used XGBoost model in real-world applications to vertical federated learning setting. State-of-the-art vertical federated XGBoost frameworks requires large number of encryption operations and ciphertext transmissions, which makes the model training much less efficient than training XGBoost models locally. To bridge this gap, we proposed a novel batch homomorphic encryption method to cut the cost of encryption-related computation and transmission in nearly half. This is achieved by encoding the first-order derivative and the second-order derivative into a single number for encryption, ciphertext transmission, and homomorphic addition operations. The sum of multiple first-order derivatives and second-order derivatives can be simultaneously decoded from the sum of encoded values. We are motivated by the batch idea in the work of BatchCrypt for horizontal federated learning, and design a novel batch method to address the limitations of allowing quite few number of negative numbers. The encode procedure of the proposed batch method consists of four steps, including shifting, truncating, quantizing and batching, while the decoding procedure consists of de-quantization and shifting back. The advantages of our method are demonstrated through theoretical analysis and extensive numerical experiments.
Regression in Python using Sklearn, XGBoost and PySpark
In the above story, we have used a Fitbit dataset. Based on the EDA, it was found that steps taken and calories are somewhat linearly correlated and together they may be indicative of a lower risk for all-cause mortality. More interestingly, among our data there is one dataset which has not been used yet which is a weight and BMI log. These data have a distinct nature since they are not necessarily machine generated, thereafter they serve the purpose of being'labels'. In simple words, users are collecting data regarding their activity using their Fitbit, and once in a while, they log some body information such as weight, fat and BMI. This creates an optimal scenario for a supervised learning problem, where, for example, we could use the Fitbit activity data to predict the BMI of a user.