Goto

Collaborating Authors

 Ensemble Learning


A guide for using the Wavelet Transform in Machine Learning

#artificialintelligence

In a previous blog-post we have seen how we can use Signal Processing techniques for the classification of time-series and signals. A very short summary of that post is: We can use the Fourier Transform to transform a signal from its time-domain to its frequency domain. The peaks in the frequency spectrum indicate the most occurring frequencies in the signal. The larger and sharper a peak is, the more prevalent a frequency is in a signal. The location (frequency-value) and height (amplitude) of the peaks in the frequency spectrum then can be used as input for Classifiers like Random Forest or Gradient Boosting.


A note on the consistency of the random forest algorithm

arXiv.org Machine Learning

Nowadays, the algorithm is acknowledged to be easy to use and to perform very well in general, even in problems involving many predictor variables (see for instance Biau and Scornet (2016) or the introduction to Scornet, Biau and Vert (2015)) โ€• so well, indeed, that several authors have posed and studied the question of their consistency (see Scornet, Biau and Vert (2015) and the earlier references provided by them). Consistent nonparametric statistical predictors have been known for a long time (e.g. Nadaraya (1964), Watson (1964), Stone (1977), Devroye and Wagner (1980)), but they converge very slowly and their computer implementations tend to be slow, especially when they involve many variables. In view of their comparative accuracy and high speed of implementation, random forests would become even more attractive if they were shown to be consistent under general data โ€ generating mechanisms. Besides, consistency is almost indispensable in applications of statistical prediction to the estimation of'causal effects' based on observational data (e.g.


Predicting movie revenue with AdaBoost, XGBoost and LightGBM

#artificialintelligence

Marvel's Avengers: Endgame recently dethroned Avatar as the highest grossing movie in history and while there was no doubt about this movie becoming very successful, I want to understand what makes any given movie a success. I am using data from The Movie Database provided through kaggle. The data set is split into a train and test set with the train set containing 3,000 movies and the test set comprising 4,398. There are 22 features in both the train and test set, including budget, genres, belongs_to_collection, runtime, keywords and more. The train data set also contains the target variable revenue.


NGBoost: Natural Gradient Boosting for Probabilistic Prediction

arXiv.org Machine Learning

We present Natural Gradient Boosting (NGBoost), an algorithm which brings probabilistic prediction capability to gradient boosting in a generic way. Predictive uncertainty estimation is crucial in many applications such as healthcare and weather forecasting. Probabilistic prediction, which is the approach where the model outputs a full probability distribution over the entire outcome space, is a natural way to quantify those uncertainties. Gradient Boosting Machines have been widely successful in prediction tasks on structured input data, but a simple boosting solution for probabilistic prediction of real valued outputs is yet to be made. NGBoost is a gradient boosting approach which uses the \emph{Natural Gradient} to address technical challenges that makes generic probabilistic prediction hard with existing gradient boosting methods. Our approach is modular with respect to the choice of base learner, probability distribution, and scoring rule. We show empirically on several regression datasets that NGBoost provides competitive predictive performance of both uncertainty estimates and traditional metrics.


How to train Boosted Trees models in TensorFlow

#artificialintelligence

Tree ensemble methods such as gradient boosted decision trees and random forests are among the most popular and effective machine learning tools available when working with structured data. Tree ensemble methods are fast to train, work well without a lot of tuning, and do not require large datasets to train on. In TensorFlow, gradient boosted trees are available using the tf.estimator API, which also supports deep neural networks, wide-and-deep models, and more. For boosted trees, regression with pre-defined mean squared error loss (BoostedTreesRegressor) and classification with cross entropy loss (BoostedTreesClassifier) are supported.


microsoft/LightGBM

#artificialintelligence

LightGBM is a gradient boosting framework that uses tree based learning algorithms. For further details, please refer to Features. Benefitting from these advantages, LightGBM is being widely-used in many winning solutions of machine learning competitions. Comparison experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, parallel experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.


Random forest model identifies serve strength as a key predictor of tennis match outcome

arXiv.org Machine Learning

Tennis is a popular sport worldwide, boasting millions of fans and numerous national and international tournaments. Like many sports, tennis has benefitted from the popularity of rigorous record-keeping of game and player information, as well as the growth of machine learning methods for use in sports analytics. Of particular interest to bettors and betting companies alike is potential use of sports records to predict tennis match outcomes prior to match start. We compiled, cleaned, and used the largest database of tennis match information to date to predict match outcome using fairly simple machine learning methods. Using such methods allows for rapid fit and prediction times to readily incorporate new data and make real-time predictions. We were able to predict match outcomes with upwards of 80% accuracy, much greater than predictions using betting odds alone, and identify serve strength as a key predictor of match outcome. By combining prediction accuracies from three models, we were able to nearly recreate a probability distribution based on average betting odds from betting companies, which indicates that betting companies are using similar information to assign odds to matches. These results demonstrate the capability of relatively simple machine learning models to quite accurately predict tennis match outcomes.


A Guide to XGBoost in Python - A site aimed at building a Data Science, Artificial Intelligence and Machine Learning empire.

#artificialintelligence

In this article, we will take a look at the various aspects of the XGBoost library. XGBoost is one of the most reliable machine learning libraries when dealing with huge datasets. In my previous article, I gave a brief introduction about XGBoost on how to use it. This article will mainly aim towards exploring many of the useful features of XGBoost. When using machine learning libraries, it is not only about building state-of-the-art models.


NFL Bet Predictor: Random Forest (Machine Learning Model) Week 5 Picks

#artificialintelligence

Our Random Forest model predicts a 66% probability of the OVER 41 points hitting with odds from Westgate in this matchup. The expected value is 30 with a 103 Diff. Check out all the betting info for the Jacksonville Jaguars vs Carolina Panthers on our matchup page. Our Random Forest model predicts a 79% probability of the Indianapolis Colts keeping it within the 5.5 points being offered at the Westgate. The expected value is 50 with a 303 Diff.


The Complete Guide to Decision Trees

#artificialintelligence

Bagging (or Bootstrap Aggregation) is used when the goal is to reduce the variance of a DT. Variance relates to the fact that DTs can be quite unstable because small variations in the data might result in a completely different Tree being generated. So, the idea of Bagging is to solve this issue by creating in parallel random subsets of data (from the training data), where any observation has the same probability to appear in a new subset data. Next, each collection of subset data is used to train DTs, resulting in an ensemble of different DTs. Finally, an average of all predictions of those different DTs is used, which produces a more robust performance than single DTs.