Goto

Collaborating Authors

 Ensemble Learning


SnapBoost: A Heterogeneous Boosting Machine

arXiv.org Machine Learning

Modern gradient boosting software frameworks, such as XGBoost and LightGBM, implement Newton descent in a functional space. At each boosting iteration, their goal is to find the base hypothesis, selected from some base hypothesis class, that is closest to the Newton descent direction in a Euclidean sense. Typically, the base hypothesis class is fixed to be all binary decision trees up to a given depth. In this work, we study a Heterogeneous Newton Boosting Machine (HNBM) in which the base hypothesis class may vary across boosting iterations. Specifically, at each boosting iteration, the base hypothesis class is chosen, from a fixed set of subclasses, by sampling from a probability distribution. We derive a global linear convergence rate for the HNBM under certain assumptions, and show that it agrees with existing rates for Newton's method when the Newton direction can be perfectly fitted by the base hypothesis at each boosting iteration. We then describe a particular realization of a HNBM, SnapBoost, that, at each boosting iteration, randomly selects between either a decision tree of variable depth or a linear regressor with random Fourier features. We describe how SnapBoost is implemented, with a focus on the training complexity. Finally, we present experimental results, using OpenML and Kaggle datasets, that show that SnapBoost is able to achieve better generalization loss than competing boosting frameworks, without taking significantly longer to tune.


Boosting Algorithms for Delivery Time Prediction in Transportation Logistics

arXiv.org Machine Learning

Travel time is a crucial measure in transportation. Accurate travel time prediction is also fundamental for operation and advanced information systems. A variety of solutions exist for short-term travel time predictions such as solutions that utilize real-time GPS data and optimization methods to track the path of a vehicle. However, reliable long-term predictions remain challenging. We show in this paper the applicability and usefulness of travel time i.e. delivery time prediction for postal services. We investigate several methods such as linear regression models and tree based ensembles such as random forest, bagging, and boosting, that allow to predict delivery time by conducting extensive experiments and considering many usability scenarios. Results reveal that travel time prediction can help mitigate high delays in postal services. We show that some boosting algorithms, such as light gradient boosting and catboost, have a higher performance in terms of accuracy and runtime efficiency than other baselines such as linear regression models, bagging regressor and random forest.


A Sequential Modelling Approach for Indoor Temperature Prediction and Heating Control in Smart Buildings

arXiv.org Artificial Intelligence

The rising availability of large volume data has enabled a wide application of statistical Machine Learning (ML) algorithms in the domains of Cyber-Physical Systems (CPS), Internet of Things (IoT) and Smart Building Networks (SBN). This paper proposes a learning-based framework for sequentially applying the data-driven statistical methods to predict indoor temperature and yields an algorithm for controlling building heating system accordingly. This framework consists of a two-stage modelling effort: in the first stage, an univariate time series model (AR) was employed to predict ambient conditions; together with other control variables, they served as the input features for a second stage modelling where an multivariate ML model (XGBoost) was deployed. The models were trained with real world data from building sensor network measurements, and used to predict future temperature trajectories. Experimental results demonstrate the effectiveness of the modelling approach and control algorithm, and reveal the promising potential of the data-driven approach in smart building applications over traditional dynamics-based modelling methods. By making wise use of IoT sensory data and ML algorithms, this work contributes to efficient energy management and sustainability in smart buildings.


Phishing Detection Using Machine Learning Techniques

arXiv.org Artificial Intelligence

The Internet has become an indispensable part of our life, However, It also has provided opportunities to anonymously perform malicious activities like Phishing. Phishers try to deceive their victims by social engineering or creating mock-up websites to steal information such as account ID, username, password from individuals and organizations. Although many methods have been proposed to detect phishing websites, Phishers have evolved their methods to escape from these detection methods. One of the most successful methods for detecting these malicious activities is Machine Learning. This is because most Phishing attacks have some common characteristics which can be identified by machine learning methods. In this paper, we compared the results of multiple machine learning methods for predicting phishing websites.


PostgreSQL and Machine Learning

#artificialintelligence

I will show you how to apply Machine Learning algorithms on data from the PostgreSQL database to get insights and predictions. I will use an Automated Machine Learning (AutoML) supervised. It is an open-source python package. Thanks to AutoML I will get quick access to many ML algorithms: Decision Tree, Logistic Regression, Random Forest, Xgboost, Neural Network. The AutoML will handle feature engineering as well.


Ensemble Machine Learning in Python : Adaboost, XGBoost

#artificialintelligence

Let's say you want to take one of the very important decision in your life, it will be a choosing your career or choosing your life partner. Do you think that you can depend on a just one person advice. Advice from the one person can be highly biased also. The best way you can go ahead by asking and taking guidance from multiple people which reduce the bias. Same thing apply on machine learning world also while predicting some class or predicting any continuous value for regression problem, why you should rely on a one model only.


StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics

arXiv.org Machine Learning

In machine learning (ML), ensemble methods such as bagging, boosting, and stacking are widely-established approaches that regularly achieve top-notch predictive performance. Stacking (also called "stacked generalization") is an ensemble method that combines heterogeneous base models, arranged in at least one layer, and then employs another metamodel to summarize the predictions of those models. Although it may be a highly-effective approach for increasing the predictive performance of ML, generating a stack of models from scratch can be a cumbersome trial-and-error process. This challenge stems from the enormous space of available solutions, with different sets of data instances and features that could be used for training, several algorithms to choose from, and instantiations of these algorithms using diverse parameters (i.e., models) that perform differently according to various metrics. In this work, we present a knowledge generation model, which supports ensemble learning with the use of visualization, and a visual analytics system for stacked generalization. Our system, StackGenVis, assists users in dynamically adapting performance metrics, managing data instances, selecting the most important features for a given data set, choosing a set of top-performant and diverse algorithms, and measuring the predictive performance. In consequence, our proposed tool helps users to decide between distinct models and to reduce the complexity of the resulting stack by removing overpromising and underperforming models. The applicability and effectiveness of StackGenVis are demonstrated with two use cases: a real-world healthcare data set and a collection of data related to sentiment/stance detection in texts. Finally, the tool has been evaluated through interviews with three ML experts.


Kaggle forecasting competitions: An overlooked learning opportunity

arXiv.org Machine Learning

Competitions play an invaluable role in the field of forecasting, as exemplified through the recent M4 competition. The competition received attention from both academics and practitioners and sparked discussions around the representativeness of the data for business forecasting. Several competitions featuring real-life business forecasting tasks on the Kaggle platform has, however, been largely ignored by the academic community. We believe the learnings from these competitions have much to offer to the forecasting community and provide a review of the results from six Kaggle competitions. We find that most of the Kaggle datasets are characterized by higher intermittence and entropy than the M-competitions and that global ensemble models tend to outperform local single models. Furthermore, we find the strong performance of gradient boosted decision trees, increasing success of neural networks for forecasting, and a variety of techniques for adapting machine learning models to the forecasting task.


Black-box Mixed-Variable Optimisation using a Surrogate Model that Satisfies Integer Constraints

arXiv.org Machine Learning

A challenging problem in both engineering and computer science is that of minimising a function for which we have no mathematical formulation available, that is expensive to evaluate, and that contains continuous and integer variables, for example in automatic algorithm configuration. Surrogate-based algorithms are very suitable for this type of problem, but most existing techniques are designed with only continuous or only discrete variables in mind. Mixed-Variable ReLU-based Surrogate Modelling (MVRSM) is a surrogate-based algorithm that uses a linear combination of rectified linear units, defined in such a way that (local) optima satisfy the integer constraints. This method outperforms the state of the art on several synthetic benchmarks with up to 238 continuous and integer variables, and achieves competitive performance on two real-life benchmarks: XGBoost hyperparameter tuning and Electrostatic Precipitator optimisation.


Randomized Gradient Boosting Machine

arXiv.org Machine Learning

Gradient Boosting Machine (GBM) introduced by Friedman is a powerful supervised learning algorithm that is very widely used in practice---it routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In spite of the usefulness of GBM in practice, our current theoretical understanding of this method is rather limited. In this work, we propose Randomized Gradient Boosting Machine (RGBM) which leads to substantial computational gains compared to GBM, by using a randomization scheme to reduce search in the space of weak-learners. We derive novel computational guarantees for RGBM. We also provide a principled guideline towards better step-size selection in RGBM that does not require a line search. Our proposed framework is inspired by a special variant of coordinate descent that combines the benefits of randomized coordinate descent and greedy coordinate descent; and may be of independent interest as an optimization algorithm. As a special case, our results for RGBM lead to superior computational guarantees for GBM. Our computational guarantees depend upon a curious geometric quantity that we call Minimal Cosine Angle, which relates to the density of weak-learners in the prediction space. On a series of numerical experiments on real datasets, we demonstrate the effectiveness of RGBM over GBM in terms of obtaining a model with good training and/or testing data fidelity with a fraction of the computational cost.