Ensemble Learning
Types of Machine Learning Algorithms and their use - Volrum
GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms which combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to a build strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix. Another classic gradient boosting algorithm that's known to be the decisive choice between winning and losing in some Kaggle competitions.
r/MachineLearning - [D] What happens when you pit an XGBoost model against a scorecard?
Anyone have any thoughts on when it's best to use se ML v. Scorecards? This blog compares predicted probabilities vs. observed proportions at the feature/predictor level. The example finds that the XGBoost model is consistently under-estimating good credit risk across all bins of this predictor while the risk Scorecard demonstrates less discrepancy between the estimated and observed outcome.
Mixed-Integer Convex Nonlinear Optimization with Gradient-Boosted Trees Embedded
Mistry, Miten, Letsios, Dimitrios, Krennrich, Gerhard, Lee, Robert M., Misener, Ruth
Decision trees usefully represent sparse, high dimensional and noisy data. Having learned a function from this data, we may want to thereafter integrate the function into a larger decision-making problem, e.g., for picking the best chemical process catalyst. We study a large-scale, industrially-relevant mixed-integer nonlinear nonconvex optimization problem involving both gradient-boosted trees and penalty functions mitigating risk. This mixed-integer optimization problem with convex penalty terms broadly applies to optimizing pre-trained regression tree models. Decision makers may wish to optimize discrete models to repurpose legacy predictive models, or they may wish to optimize a discrete model that particularly well-represents a data set. We develop several heuristic methods to find feasible solutions, and an exact, branch-and-bound algorithm leveraging structural properties of the gradient-boosted trees and penalty functions. We computationally test our methods on concrete mixture design instance and a chemical catalysis industrial instance.
Example of Random Forest application in Finance : Option Pricing
Let's assume we know how much Tesla share costs in 2W. Our'only' unknown is the future option value (Y_T), given all information we have at t 2W. In other terms, if you are in two weeks time (i.e. in the future), what's the expected value of your portfolio, made of this one american option. You have information at 2W and you want to predict the option value at 1M. Beforehand, we need to simulate multiple scenarios for Tesla share price. For model simplicity, we suppose Tesla Share follows a Geometric Brownian motion path with mean r (risk free rate) and volatility Sigma 20% (we refer interested readers to Stochastic processes theory).
Monotonicity constraints in machine learning
In practical machine learning and data science tasks, an ML model is often used to quantify a global, semantically meaningful relationship between two or more values. For example, a hotel chain might want to use ML to optimize their pricing strategy and use a model to estimate the likelihood of a room being booked at a given price and day of the week. For a relationship like this the assumption is that, all other things being equal, a cheaper price is preferred by a user, so demand is higher at a lower price. However what might easily happen is that upon building the model, the data scientist discovers that the model is behaving unexpectedly: for example the model predicts that on Tuesdays, the clients would rather pay $110 than $100 for a room! The reason is that while there is an expected monotonic relationship between price and the likelihood of booking, the model is unable to (fully) capture it, due to noisiness of the data and confounds in it.
Is rotation forest the best classifier for problems with continuous features?
Bagnall, A., Bostrom, A., Cawley, G., Flynn, M., Large, J., Lines, J.
Rotation forest is a tree based ensemble that performs transforms on subsets of attributes prior to constructing each tree. We present an empirical comparison of classifiers for problems with only real valued features. We evaluate classifiers from three families of algorithms: support vector machines; tree-based ensembles; and neural networks. We compare classifiers on unseen data based on the quality of the decision rule (using classification error) the ability to rank cases (area under the receiver operator curve) and the probability estimates (using negative log likelihood). We conclude that, in answer to the question posed in the title, yes, rotation forest, is significantly more accurate on average than competing techniques when compared on three distinct sets of datasets. The same pattern of results are observed when tuning classifiers on the train data using a grid search. We investigate why rotation forest does so well by testing whether the characteristics of the data can be used to differentiate classifier performance. We assess the impact of the design features of rotation forest through an ablative study that transforms random forest into rotation forest. We identify the major limitation of rotation forest as its scalability, particularly in number of attributes. To overcome this problem we develop a model to predict the train time of the algorithm and hence propose a contract version of rotation forest where a run time cap {\em a priori}. We demonstrate that on large problems rotation forest can be made an order of magnitude faster without significant loss of accuracy and that there is no real benefit (on average) from tuning the ensemble. We conclude that without any domain knowledge to indicate an algorithm preference, rotation forest should be the default algorithm of choice for problems with continuous attributes.
Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms
Anghel, Andreea, Papandreou, Nikolaos, Parnell, Thomas, De Palma, Alessandro, Pozidis, Haralampos
Abstract--Gradient boosted decision trees (GBDTs) have seen widespread adoption in academia, industry and competitive data science due to their state-of-the-art performance in a wide variety of machine learning tasks. In this paper, we present an extensive empirical comparison of XGBoost, LightGBM and CatBoost, three popular GBDT algorithms, to aid the data science practitioner in the choice from the multitude of available implementations. Specifically, we evaluate their behavior on four largescale datasets with varying shapes, sparsities and learning tasks, in order to evaluate the algorithms' generalization performance, training times (on both CPU and GPU) and their sensitivity to hyper-parameter tuning. In our analysis, we first make use of a distributed grid-search to benchmark the algorithms on fixed configurations, and then employ a state-of-the-art algorithm for Bayesian hyper-parameter optimization to fine-tune the models. Many powerful techniques in machine learning involve constructing a strong learner from a number of weak learners. One such approach, known as bagging, combines the predictions of a large number of weak learners, each using a different bootstrap sample of the training data set [1]. It has been shown that such a an approach can reduce variance and produce a strong learner. An alternative approach, known as boosting, involves iteratively training a sequence of weak learners, whereby the training examples for the next learner are weighted according to the success of the previouslyconstructed learners.
Perturb and Combine to Identify Influential Spreaders in Real-World Networks
Tixier, Antoine J. -P., Rossi, Maria-Evgenia G., Malliaros, Fragkiskos D., Read, Jesse, Vazirgiannis, Michalis
Recent research has shown that graph degeneracy algorithms, which decompose a network into a hierarchy of nested subgraphs of decreasing size and increasing density, are very effective at detecting the good spreaders in a network. However, it is also known that degeneracy-based decompositions of a graph are unstable to small perturbations of the network structure. In Machine Learning, the performance of unstable classification and regression methods, such as fully-grown decision trees, can be greatly improved by using Perturb and Combine (P&C) strategies such as bagging (bootstrap aggregating). Therefore, we propose a P&C procedure for networks that (1) creates many perturbed versions of a given graph, (2) applies a node scoring function separately to each graph (such as a degeneracy-based one), and (3) combines the results. We conduct real-world experiments on the tasks of identifying influential spreaders in large social networks, and influential words (keywords) in small word co-occurrence networks. We use the k-core, generalized k-core, and PageRank algorithms as our vertex scoring functions. In each case, using the aggregated scores brings significant improvements compared to using the scores computed on the original graphs. Finally, a bias-variance analysis suggests that our P&C procedure works mainly by reducing bias, and that therefore, it should be capable of improving the performance of all vertex scoring functions, not only unstable ones.
A
Data science is an immensely powerful tool in our data-driven world. Call me idealistic, but I believe this tool should be used for more than getting people to click on ads or spend more time consumed by social media. Not only do we get to improve our data science skills in the most effective manner - through practice on real-world data - but we also get the reward of working on a problem with social benefits. The full code is available as a Jupyter Notebook both on Kaggle (where it can be run in the browser with no downloads required) and on GitHub. This is an active Kaggle competition and a great project to get started with machine learning or to work on some new skills. The Costa Rican Household Poverty Level Prediction challenge is a data science for good machine learning competition currently running on Kaggle.
Finding Dory in the Crowd: Detecting Social Interactions using Multi-Modal Mobile Sensing
Katevas, Kleomenis, Hänsel, Katrin, Clegg, Richard, Leontiadis, Ilias, Haddadi, Hamed, Tokarchuk, Laurissa
Remembering our day-to-day social interactions is challenging even if you aren't a blue memory challenged fish. The ability to automatically detect and remember these types of interactions is not only beneficial for individuals interested in their behavior in crowded situations, but also of interest to those who analyze crowd behavior. Currently, detecting social interactions is often performed using a variety of methods including ethnographic studies, computer vision techniques and manual annotation-based data analysis. However, mobile phones offer easier means for data collection that is easy to analyze and can preserve the user's privacy. In this work, we present a system for detecting stationary social interactions inside crowds, leveraging multi-modal mobile sensing data such as Bluetooth Smart (BLE), accelerometer and gyroscope. To inform the development of such system, we conducted a study with 24 participants, where we asked them to socialize with each other for 45 minutes. We built a machine learning system based on gradient-boosted trees that predicts both 1:1 and group interactions with 77.8% precision and 86.5% recall, a 30.2% performance increase compared to a proximity-based approach. By utilizing a community detection based method, we further detected the various group formation that exist within the crowd. Using mobile phone sensors already carried by the majority of people in a crowd makes our approach particularly well suited to real-life analysis of crowd behaviour and influence strategies.