Ensemble Learning
Shehroz Khan's answer to What are some machine learning algorithms I can learn without calculus? - Quora
Calculus is not prerequisite to learn a lot of ML algorithms such as KNN, Naive Bayes, Decsion Trees, Random Forest, boosting etc and similar methods. However, if you wanna go the route for neural nets or deep learning, you need to figure out calculus because algorithms such as backpropagation uses them a lot. Also Gradient boosting and similar type of algorithms will require calculus knowledge.
dmlc/xgboost
This plugin currently works with the CLI version and python version. The maximum number of nodes needed for a given tree depth d is 2d 1 - 1. The maximum number of nodes on any given level is 2d. Data is stored in a sparse format. For example, missing values produced by one hot encoding are not stored.
Walmart Competition: Trip Type Classification
They took the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. The post was based on their fourth in-class project (due after the 8th week of the program). Walmart uses trip type classification to segment its shoppers and their store visits to better improve the shopping experience. Walmart's trip types are created from a combination of existing customer insights and purchase history data. The purpose of the Kaggle competition is to use only the purchase data provided to derive Walmart's classification labels.
Quant Trading using Machine Learning - Udemy
Prerequisites: Working knowledge of Python is necessary if you want to run the source code that is provided. Basic knowledge of machine learning, especially ML classification techniques, would be helpful but it's not mandatory. Taught by a Stanford-educated, ex-Googler and an IIT, IIM - educated ex-Flipkart lead analyst. This team has decades of practical experience in quant trading, analytics and e-commerce. Completely Practical: This course has just enough theory to get you started with both Quant Trading and Machine Learning.
Random Forests in Python
This post originally appeared on the Yhat blog. Yhat is a Brooklyn based company whose goal is to make data science applicable for developers, data scientists, and businesses alike. Yhat provides a software platform for deploying and managing predictive algorithms as REST APIs, while eliminating the painful engineering obstacles associated with production environments like testing, versioning, scaling and security. It can be used to on customer acquisition, retention, and churn or to in patients. Random forest is capable of regression and classification. It can handle a large number of features, and it's helpful for estimating which of your variables are important in the underlying data being modeled.
Want to Win at Kaggle? Pay Attention to Your Ensembles.
The Kaggle competitions are like formula racing for data science. Winners edge out competitors at the fourth decimal place and like Formula 1 race cars, not many of us would mistake them for daily drivers. The amount of time devoted and the sometimes extreme techniques wouldn't be appropriate in a data science production environment, but like paddle shifters and exotic suspensions, some of those improvement find their way into day-to-day life. Ensembles, or teams of predictive models working together, have been the core strategy for winning at Kaggle. They've been around for a long time but they are getting better.
rushter/MLAlgorithms
A collection of minimal and clean implementations of machine learning algorithms. This project is targeting people who want to learn internals of ml algorithms or implement them from scratch. The code is much easier to follow than the optimized libraries and easier to play with. All algorithms are implemented in Python, using numpy, scipy and autograd.
When Does Deep Learning Work Better Than SVMs or Random Forests?
Guest blog by Sebastian Raschka, originally posted here. If we tackle a supervised learning problem, my advice is to start with the simplest hypothesis space first. I.e., try a linear model such as logistic regression. If this doesn't work "well" (i.e., it doesn't meet our expectation or performance criterion that we defined earlier), I would move on to the next experiment. I would say that random forests are probably THE "worry-free" approach - if such a thing exists in ML: There are no real hyperparameters to tune (maybe except for the number of trees; typically, the more trees we have the better).
Data Scientists Automated and Unemployed by 2025!
In a recent poll the question was raised "Will Data Scientists be replaced by software, and if so, when?" Are we really just grist for the AI mill? As part of the broader digital technology revolution we data scientists regard ourselves as part of the solution not part of the problem. But as part of this fast moving industry built on identifying and removing pain points it's possible to see that we are actually part of the problem. Seen as a good news / bad news story it goes like this. The good news is that advanced predictive analytics are gaining acceptance and penetration at an ever expanding rate.
One Class Splitting Criteria for Random Forests
Goix, Nicolas, Drougard, Nicolas, Brault, Romain, Chiapino, Maël
Random Forests (RFs) are strong machine learning tools for classification and regression. However, they remain supervised algorithms, and no extension of RFs to the one-class setting has been proposed, except for techniques based on second-class sampling. This work fills this gap by proposing a natural methodology to extend standard splitting criteria to the one-class setting, structurally generalizing RFs to one-class classification. An extensive benchmark of seven state-of-the-art anomaly detection algorithms is also presented. This empirically demonstrates the relevance of our approach.