Ensemble Learning
Optimizing Ensemble Weights and Hyperparameters of Machine Learning Models for Regression Problems
Shahhosseini, Mohsen, Hu, Guiping, Pham, Hieu
Aggregating multiple learners through an ensemble of models aims to make better predictions by capturing the underlying distribution more accurately. Different ensembling methods, such as bagging, boosting and stacking/blending, have been studied and adopted extensively in research and practice. While bagging and boosting intend to reduce variance and bias, respectively, blending approaches target both by finding the optimal way to combine base learners to find the best trade-off between bias and variance. In blending, ensembles are created from weighted averages of multiple base learners. In this study, a systematic approach is proposed to find the optimal weights to create these ensembles for bias-variance tradeoff using cross-validation for regression problems (Cross-validated Optimal Weighted Ensemble (COWE)). Furthermore, it is known that tuning hyperparameters of each base learner inside the ensemble weight optimization process can produce better performing ensembles. To this end, a nested algorithm based on bi-level optimization that considers tuning hyperparameters as well as finding the optimal weights to combine ensembles (Cross-validated Optimal Weighted Ensemble with Internally Tuned Hyperparameters (COWE-ITH)) was proposed. The algorithm is shown to be generalizable to real data sets though analyses with ten publicly available data sets. The prediction accuracies of COWE-ITH and COWE have been compared to base learners and the state-of-art ensemble methods. The results show that COWE-ITH outperforms other benchmarks as well as base learners in 9 out of 10 data sets.
Random Forests for Store Forecasting at Walmart Scale
The SMART Forecasting team at Walmart Labs is tasked with providing demand forecasts for over 70 million store-item combinations every week! For example, just how much of every type of ginger needs to go to every Walmart store in the U.S., every week for the next 52 weeks, with the goal of improving in stocks and reducing food waste. Our algorithm strategy was to build a suite of machine learning models and deploy them at scale to generate bespoke solutions for (oh so many!) store-item-week combinations. Random Forests would be part of this suite. We went through the traditional model development workflow of data discovery, identifying demand drivers, feature engineering, training, cross validation and testing.
Gradient Boosting Machine: A Survey
He, Zhiyuan, Lin, Danchen, Lau, Thomas, Wu, Mike
Proposed by Freund and Schapire ( 1997), boosting is a general issue of constructing an extremely accurate prediction with numerous roughly accurate pred ictions. Addressed by Friedman ( 2001, 2002) and Natekin and Knoll ( 2013), the Gradient Boosting Machines (GBM) seeks to build predictive models through back-fittings and no n-parametric regressions. Instead of building a single model, the GBM starts by generatin g an initial model and constantly fits new models through loss function minimization to prod uce the most precise model ( Natekin and Knoll, 2013). This survey concentrates on the mathematical derivations of the gradient boosting algorithms. In Section 2, we analyze the optimization methods for par ametric and nonparametric models. Section 3 covers the definitions of different typ es of loss functions. In Section 4, we present different types of boosting algorithms, while in Section 5, we explore the combination of boosting algorithms and ranking algorithms to ran k the real-world data.
Predicting Eating Events in Free Living Individuals -- A Technical Report
Wang, Jiayi, Yang, Jiue-An, Nakandala, Supun, Kumar, Arun, Jankowska, Marta M.
This technical report records the experiments of applying multiple machine learning algorithms for predicting eating and food purchasing behaviors of free-living individuals. Data was collected with accelerometer, global positioning system (GPS), and body-worn cameras called SenseCam over a one week period in 81 individuals from a variety of ages and demographic backgrounds. These data were turned into minute-level features from sensors as well as engineered features that included time (e.g., time since last eating) and environmental context (e.g., distance to nearest grocery store). Algorithms include Logistic Regression, RBF-SVM, Random Forest, and Gradient Boosting. Our results show that the Gradient Boosting model has the highest mean accuracy score (0.7289) for predicting eating events before 0 to 4 minutes. For predicting food purchasing events, the RBF-SVM model (0.7395) outperforms others. For both prediction models, temporal and spatial features were important contributors to predicting eating and food purchasing events.
Adaptive Ensemble of Classifiers with Regularization for Imbalanced Data Classification
Wang, Chen, Yu, Qin, Luo, Ruisen, Hui, Dafeng, Zhou, Kai, Yu, Yanmei, Sun, Chao, Gong, Xiaofeng
Dynamic ensembling of classifiers is an effective approach in processing label-imbalanced classifications. However, in dynamic ensemble methods, the combination of classifiers is usually determined by the local competence and conventional regularization methods are difficult to apply, leaving the technique prone to overfitting. In this paper, focusing on the binary label-imbalanced classification field, a novel method of Adaptive Ensemble of classifiers with Regularization (AER) has been proposed. The method deals with the overfitting problem from a perspective of implicit regularization. Specifically, it leverages the properties of Stochastic Gradient Descent (SGD) to obtain the solution with the minimum norm to achieve regularization, and interpolates ensemble weights via the global geometry of data to further prevent overfitting. The method enjoys a favorable time and memory complexity, and theoretical proofs show that algorithms implemented with AER paradigm have time and memory complexities upper-bounded by their original implementations. Furthermore, the proposed AER method is tested with a specific implementation based on Gradient Boosting Machine (XGBoost) on the three datasets: UCI Bioassay, KEEL Abalone19, and a set of GMM-sampled artificial dataset. Results show that the proposed AER algorithm can outperform the major existing algorithms based on multiple metrics, and Mcnemar's tests are applied to validate performance superiorities. To summarize, this work complements regularization for dynamic ensemble methods and develops an algorithm superior in grasping both the global and local geometry of data to alleviate overfitting in imbalanced data classification.
Gradient Boosting Survival Tree with Applications in Credit Scoring
Bai, Miaojun, Zheng, Yan, Shen, Yun
Credit scoring (Thomas et al., 2002) plays a vital role in the field of consumer finance. Survival analysis (Banasik et al., 1999) provides an advanced solution to the credit-scoring problem by quantifying the probability of survival time. In order to deal with highly heterogeneous industrial data collected in Chinese market of consumer finance, we propose a nonparametric ensemble tree model called gradient boosting survival tree (GBST) that extends the survival tree models (Gordon and Olshen, 1985; Ishwaran et al., 2008) with a gradient boosting algorithm (Friedman, 2001). The survival tree ensemble is learned by minimizing the negative log-likelihood in an additive manner. The proposed model optimizes the survival probability simultaneously for each time period, which can reduce the overall error significantly. Finally, as a test of the applicability, we apply the GBST model to quantify the credit risk with large-scale real market data. The results show that the GBST model outperforms the existing survival models measured by the concordance index (C-index), Kolmogorov-Smirnov (KS) index, as well as by the area under the receiver operating characteristic curve (AUC) of each time period.
A real-time iterative machine learning approach for temperature profile prediction in additive manufacturing processes
Paul, Arindam, Mozaffar, Mojtaba, Yang, Zijiang, Liao, Wei-keng, Choudhary, Alok, Cao, Jian, Agrawal, Ankit
--Additive Manufacturing (AM) is a manufacturing paradigm that builds three-dimensional objects from a computer-aided design model by successively adding material layer by layer . AM has become very popular in the past decade due to its utility for fast prototyping such as 3D printing as well as manufacturing functional parts with complex geometries using processes such as laser metal deposition that would be difficult to create using traditional machining. As the process for creating an intricate part for an expensive metal such as Titanium is prohibitive with respect to cost, computational models are used to simulate the behavior of AM processes before the experimental run. However, as the simulations are computationally costly and time-consuming for predicting multiscale multi-physics phenomena in AM, physics-informed data-driven machine-learning systems for predicting the behavior of AM processes are immensely beneficial. Such models accelerate not only multiscale simulation tools but also empower real-time control systems using in-situ data. In this paper, we design and develop essential components of a scientific framework for developing a data-driven model-based real-time control system. Finite element methods are employed for solving time-dependent heat equations and developing the database. The proposed framework uses extremely randomized trees - an ensemble of bagged decision trees as the regression algorithm iteratively using temperatures of prior voxels and laser information as inputs to predict temperatures of subsequent voxels. The models achieve mean absolute percentage errors below 1% for predicting temperature profiles for AM processes. Additive Manufacturing (AM) is a modern manufacturing approach in which digital 3D design data is used to build parts by sequentially depositing layers of materials [1]. AM techniques are becoming very popular compared to traditional approaches because of their success in building complicated designs, fast prototyping, and low-volume or one-of-a-kind productions across many industries. Direct Metal Deposition (DMD) [2] is an AM technology where various materials such as steel or Titanium are used to develop the finished product.
Comparison of Artificial Intelligence Techniques for Project Conceptual Cost Prediction
Developing a reliable parametric cost model at the conceptual stage of the project is crucial for projects managers and decision-makers. Existing methods, such as probabilistic and statistical algorithms have been developed for project cost prediction. However, these methods are unable to produce accurate results for conceptual cost prediction due to small and unstable data samples. Artificial intelligence (AI) and machine learning (ML) algorithms include numerous models and algorithms for supervised regression applications. Therefore, a comparison analysis for AI models is required to guide practitioners to the appropriate model. The study focuses on investigating twenty artificial intelligence (AI) techniques which are conducted for cost modeling such as fuzzy logic (FL) model, artificial neural networks (ANNs), multiple regression analysis (MRA), case-based reasoning (CBR), hybrid models, and ensemble methods such as scalable boosting trees (XGBoost). Field canals improvement projects (FCIPs) are used as an actual case study to analyze the performance of the applied ML models. Out of 20 AI techniques, the results showed that the most accurate and suitable method is XGBoost with 9.091% and 0.929 based on Mean Absolute Percentage Error (MAPE) and adjusted R2. Nonlinear adaptability, handling missing values and outliers, model interpretation and uncertainty have been discussed for the twenty developed AI models. Keywords: Artificial intelligence, Machine learning, ensemble methods, XGBoost, evolutionary fuzzy rules generation, Conceptual cost, and parametric cost model.
Imbalance-XGBoost: Leveraging Weighted and Focal Losses for Binary Label-Imbalanced Classification with XGBoost
Wang, Chen, Deng, Chengyuan, Wang, Suzhen
The paper presents Imbalance-XGBoost, a Python package that combines the powerful XGBoost software with weighted and focal losses to tackle binary label-imbalanced classification tasks. Though a small-scale program in terms of size, the package is, to the best of the authors' knowledge, the first of its kind which provides an integrated implementation for the two losses on XGBoost and brings a general-purpose extension on XGBoost for label-imbalanced scenarios. In this paper, the design and usage of the package are described with exemplar code listings, and its convenience to be integrated into Python-driven Machine Learning projects is illustrated. Furthermore, as the first- and second-order derivatives of the loss functions are essential for the implementations, the algebraic derivation is discussed and it can be deemed as a separate algorithmic contribution. The performances of the algorithms implemented in the package are empirically evaluated on Parkinson's disease classification data set, and multiple state-of-the-art performances have been observed. Given the scalable nature of XGBoost, the package has great potentials to be applied to real-life binary classification tasks, which are usually of large-scale and label-imbalanced.
Climate-driven statistical models as effective predictors of local dengue incidence in Costa Rica: A Generalized Additive Model and Random Forest approach
Vásquez, Paola, Loría, Antonio, Sanchez, Fabio, Barboza, Luis A.
Climate has been an important factor in shaping the distribution and incidence of dengue cases in tropical and subtropical countries. In Costa Rica, a tropical country with distinctive micro-climates, dengue has been endemic since its introduction in 1993, inflicting substantial economic, social, and public health repercussions. Using the number of dengue reported cases and climate data from 2007-2017, we fitted a prediction model applying a Generalized Additive Model (GAM) and Random Forest (RF) approach, which allowed us to retrospectively predict dengue occurrence in five climatological diverse municipalities around the country.