Goto

Collaborating Authors

 Ensemble Learning


Choosing features for random forests algorithm

@machinelearnbot

There are many ways to choose features with given data, and it is always a challenge to pick up the ones with which a particular algorithm will work better. Here I will consider data from monitoring performance of physical exercises with wearable accelerometers, for example, wrist bands. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. In this project, researchers used data from accelerometers on the belt, forearm, arm, and dumbbell of few participants. They were asked to perform barbell lifts correctly, marked as "A", and incorrectly with four typical mistakes, marked as "B", "C", "D" and "E".


Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python

#artificialintelligence

If you have been using GBM as a'black box' till now, may be it's time for you to open it and see, how it actually works! This article is inspired by Owen Zhang's (Chief Product Officer at DataRobot and Kaggle Rank 3) approach shared at NYC Data Science Academy. He delivered a 2 hours talk and I intend to condense it and present the most precious nuggets here. Boosting algorithms play a crucial role in dealing with bias variance trade-off. Unlike bagging algorithms, which only controls for high variance in a model, boosting controls both the aspects (bias & variance), and is considered to be more effective.


The Shape of the Trees in Gradient Boosting Machines

#artificialintelligence

Our CEO and founder, Dr. Dan Steinberg recently wrote about gradient boosting machines. Gradient boosting machines are a powerful machine learning technique, and have been deployed with great success over the years in Kaggle competitions. However, specifics of the construction and core ideas of gradient boosting machines can remain a bit murky. For more a more detailed look at the shapes and sizes of the trees formed in gradient boosting machines, read the discussion on Dr. Steinberg's blog:


Telstra Network Disruption, Winner's Interview: 1st place, Mario Filho

#artificialintelligence

Telstra Network Disruptions challenged Kagglers to predict the severity of service disruptions on their network. Using a dataset of features from their service logs, participants were tasked with predicting if a disruption was a momentary glitch or a total interruption of connectivity. Mario Filho, a self-taught data scientist, took first place in his first "solo win". In this blog, he shares a high-level view of his approach. My background in machine learning is completely "self-taught". It all began in 2012 when I decided to learn Calculus on my own through the videos from a MIT class.


dmlc/xgboost

#artificialintelligence

This page contains a curated list of examples, tutorials, blogs about XGBoost usecases. It is inspired by awesome-MXNet, awesome-php and awesome-machine-learning. Please send a pull request if you find things that belongs to here. This is a list of short codes introducing different functionalities of xgboost packages. Most of examples in this section are based on CLI or python version.


Lost in a random forest: Using Big Data to study rare events News & Analysis

#artificialintelligence

Sudden, broad-scale shifts in public opinion about social problems are relatively rare. Until recently, social scientists were forced to conduct post-hoc case studies of such unusual events that ignore the broader universe of possible shifts in public opinion that do not materialize. The vast amount of data that has recently become available via social media sites such as Facebook and Twitter--as well as the mass-digitization of qualitative archives provide an unprecedented opportunity for scholars to avoid such selection on the dependent variable. Yet the sheer scale of these new data creates a new set of methodological challenges. Conventional linear models, for example, minimize the influence of rare events as "outliers"--especially within analyses of large samples.


Walmart Kaggle: Trip Type Classification

@machinelearnbot

They took the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. The post was based on their fourth in-class project (due after the 8th week of the program). Walmart uses trip type classification to segment its shoppers and their store visits to better improve the shopping experience. Walmart's trip types are created from a combination of existing customer insights and purchase history data. The purpose of the Kaggle competition is to use only the purchase data provided to derive Walmart's classification labels.


XGboost Archives - The Big Data Blog

#artificialintelligence

We learn more from code, and from great code. Not necessarily always the 1st ranking solution, because we also learn what makes a stellar and just a good solution. I will post solutions I came upon so we can all learn to become better! I collected the following source code and interesting discussions from the Kaggle held competitions for learning purposes. Not all competitions are listed because I am only manually collecting them, also some competitions are not listed due to no one sharing.


XGBoost4J: Portable Distributed XGBoost in Spark, Flink and Dataflow

#artificialintelligence

XGBoost is a library designed and optimized for tree boosting. Gradient boosting trees model is originally proposed by Friedman et al. By embracing multi-threads and introducing regularization, XGBoost delivers higher computational power and more accurate prediction. More than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost (Incomplete list). XGBoost has provided native interfaces for C, R, python, Julia and Java users.


XGBoost: A Scalable Tree Boosting System

#artificialintelligence

"Abstract Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system.