Ensemble Learning
Condensed Gradient Boosting
Emami, Seyedsaman, Martínez-Muñoz, Gonzalo
This paper presents a computationally efficient variant of gradient boosting for multi-class classification and multi-output regression tasks. Standard gradient boosting uses a 1-vs-all strategy for classifications tasks with more than two classes. This strategy translates in that one tree per class and iteration has to be trained. In this work, we propose the use of multi-output regressors as base models to handle the multi-class problem as a single task. In addition, the proposed modification allows the model to learn multi-output regression problems. An extensive comparison with other multi-ouptut based gradient boosting methods is carried out in terms of generalization and computational efficiency. The proposed method showed the best trade-off between generalization ability and training and predictions speeds.
OOG- Optuna Optimized GAN Sampling Technique for Tabular Imbalanced Malware Data
Tonmoy, S. M Towhidul Islam, Zaman, S. M Mehedi
Cyberspace occupies a large portion of people's life in the age of modern technology, and while there are those who utilize it for good, there are also those who do not. Malware is an application whose construction was not motivated by a benign goal and it can harm, steal, or even alter personal information and secure applications and software. Thus, there are numerous techniques to avoid malware, one of which is to develop samples of malware so that the system can be updated with the growing number of malwares, allowing it to recognize when malwares attempt to enter. The Generative Adversarial Network (GAN) sampling technique has been used in this study to generate new malware samples. GANs have multiple variants, and in order to determine which variant is optimal for a given dataset sample, their parameters must be modified. This study employs Optuna, an autonomous hyperparameter tuning algorithm, to determine the optimal settings for the dataset under consideration. In this study, the architecture of the Optuna Optimized GAN (OOG) method is shown, along with scores of 98.06%, 99.00%, 97.23%, and 98.04% for accuracy, precision, recall and f1 score respectively. After tweaking the hyperparameters of five supervised boosting algorithms, XGBoost, LightGBM, CatBoost, Extra Trees Classifier, and Gradient Boosting Classifier, the methodology of this paper additionally employs the weighted ensemble technique to acquire this result. In addition to comparing existing efforts in this domain, the study demonstrates how promising GAN is in comparison to other sampling techniques such as SMOTE.
SketchBoost: Fast Gradient Boosted Decision Tree for Multioutput Problems
Iosipoi, Leonid, Vakhrushev, Anton
Gradient Boosted Decision Tree (GBDT) is a widely-used machine learning algorithm that has been shown to achieve state-of-the-art results on many standard data science problems. We are interested in its application to multioutput problems when the output is highly multidimensional. Although there are highly effective GBDT implementations, their scalability to such problems is still unsatisfactory. In this paper, we propose novel methods aiming to accelerate the training process of GBDT in the multioutput scenario. The idea behind these methods lies in the approximate computation of a scoring function used to find the best split of decision trees. These methods are implemented in SketchBoost, which itself is integrated into our easily customizable Python-based GPU implementation of GBDT called Py-Boost. Our numerical study demonstrates that SketchBoost speeds up the training process of GBDT by up to over 40 times while achieving comparable or even better performance.
High-Order Optimization of Gradient Boosted Decision Trees
Pachebat, Jean, Ivanov, Sergei
Gradient Boosted Decision Trees (GBDTs) are dominant machine learning algorithms for modeling discrete or tabular data. Unlike neural networks with millions of trainable parameters, GBDTs optimize loss function in an additive manner and have a single trainable parameter per leaf, which makes it easy to apply high-order optimization of the loss function. In this paper, we introduce high-order optimization for GBDTs based on numerical optimization theory which allows us to construct trees based on high-order derivatives of a given loss function. In the experiments, we show that high-order optimization has faster per-iteration convergence that leads to reduced running time. Our solution can be easily parallelized and run on GPUs with little overhead on the code. Finally, we discuss future potential improvements such as automatic differentiation of arbitrary loss function and combination of GBDTs with neural networks.
Explaining Random Forests using Bipolar Argumentation and Markov Networks (Technical Report)
Potyka, Nico, Yin, Xiang, Toni, Francesca
Random forests (RFs) [Bre01] are machine learning models with various applications in areas like E-commerce, Finance and Medicine. They consist of multiple decision trees that use different subsets of the available features. Given an input, every tree makes an individual decision and the output of the random forest is obtained by a majority vote. They have low risk of overfitting; support both classification and regression tasks and come equipped with some feature importance measures [Bre01]. However, feature importance measures can be too simplistic as they can represent neither joint effects of features (e.g., multi-drug interactions) nor non-monotonicity (e.g., increasing the weight may be healthy for an underweight person, but not for an overweight person). In recent years, a variety of other explanation methods has been proposed. Modelagnostic feature importance measures like LIME [RSG16], SHAP [LL17] and MAPLE [PMT18] have similar limitations like the feature importance measures defined for random forests.
Machine Learning Methods for Anomaly Detection in Nuclear Power Plant Power Transformers
Katser, Iurii, Raspopov, Dmitriy, Kozitsin, Vyacheslav, Mezhov, Maxim
Power transformers are an important component of a nuclear power plant (NPP). Currently, the NPP operates a lot of power transformers with extended service life, which exceeds the designated 25 years. Due to the extension of the service life, the task of monitoring the technical condition of power transformers becomes urgent. An important method for monitoring power transformers is Chromatographic Analysis of Dissolved Gas. It is based on the principle of controlling the concentration of gases dissolved in transformer oil. The appearance of almost any type of defect in equipment is accompanied by the formation of gases that dissolve in oil, and specific types of defects generate their gases in different quantities. At present, at NPPs, the monitoring systems for transformer equipment use predefined control limits for the concentration of dissolved gases in the oil. This study describes the stages of developing an algorithm to detect defects and faults in transformers automatically using machine learning and data analysis methods. Among machine learning models, we trained Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Neural Networks. The best of them were then combined into an ensemble (StackingClassifier) showing F1-score of 0.974 on a test sample. To develop mathematical models, we used data on the state of transformers, containing time series with values of gas concentrations (H2, CO, C2H4, C2H2). The datasets were labeled and contained four operating modes: normal mode, partial discharge, low energy discharge, low-temperature overheating.
A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System
This is the first in a series of posts that illustrate what our data team is up to, experimenting with, and building'under the hood' at CitizenNet. He has been involved in web-scale machine learning and information retrieval for over 10 years. One of the first posts we published spoke at a high level of the technical problem CitizenNet is trying to solve. In essence, we are trying to predict what combinations of demographic and interest targets will be interested in some piece of content. On the CitizenNet platform, a user would create a project that would define (broadly) the target audience, the pieces of Facebook content they are looking to promote, and other campaign and financial information. Behind the scenes, a robust prediction system builds the targets for the project.
SleepMore: Inferring Sleep Duration at Scale via Multi-Device WiFi Sensing
Zakaria, Camellia, Yilmaz, Gizem, Mammen, Priyanka, Chee, Michael, Shenoy, Prashant, Balan, Rajesh
The availability of commercial wearable trackers equipped with features to monitor sleep duration and quality has enabled more useful sleep health monitoring applications and analyses. However, much research has reported the challenge of long-term user retention in sleep monitoring through these modalities. Since modern Internet users own multiple mobile devices, our work explores the possibility of employing ubiquitous mobile devices and passive WiFi sensing techniques to predict sleep duration as the fundamental measure for complementing long-term sleep monitoring initiatives. In this paper, we propose SleepMore, an accurate and easy-to-deploy sleep-tracking approach based on machine learning over the user's WiFi network activity. It first employs a semi-personalized random forest model with an infinitesimal jackknife variance estimation method to classify a user's network activity behavior into sleep and awake states per minute granularity. Through a moving average technique, the system uses these state sequences to estimate the user's nocturnal sleep period and its uncertainty rate. Uncertainty quantification enables SleepMore to overcome the impact of noisy WiFi data that can yield large prediction errors. We validate SleepMore using data from a month-long user study involving 46 college students and draw comparisons with the Oura Ring wearable. Beyond the college campus, we evaluate SleepMore on non-student users of different housing profiles. Our results demonstrate that SleepMore produces statistically indistinguishable sleep statistics from the Oura ring baseline for predictions made within a 5% uncertainty rate. These errors range between 15-28 minutes for determining sleep time and 7-29 minutes for determining wake time, proving statistically significant improvements over prior work. Our in-depth analysis explains the sources of errors.
Language Model Classifier Aligns Better with Physician Word Sensitivity than XGBoost on Readmission Prediction
Yang, Grace, Cao, Ming, Jiang, Lavender Y., Liu, Xujin C., Cheung, Alexander T. M., Weiss, Hannah, Kurland, David, Cho, Kyunghyun, Oermann, Eric K.
Traditional evaluation metrics for classification in natural language processing such as accuracy and area under the curve fail to differentiate between models with different predictive behaviors despite their similar performance metrics. We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level to provide insights into disparities in their decision-making logic. We assess the sensitivity score on a set of representative words in the test set using two classifiers trained for hospital readmission classification with similar performance statistics. Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores. The results indicate that the language model's sensitivity score aligns better with the professionals than the xgboost classifier on tf-idf embeddings, which suggests that xgboost uses some spurious features. Overall, this metric offers a novel perspective on assessing models' robustness by quantifying their discrepancy with professional opinions. Our code is available on GitHub (https://github.com/nyuolab/Model_Sensitivity).