Decision Tree Learning
Random Forests in Python
This post originally appeared on the Yhat blog. Yhat is a Brooklyn based company whose goal is to make data science applicable for developers, data scientists, and businesses alike. Yhat provides a software platform for deploying and managing predictive algorithms as REST APIs, while eliminating the painful engineering obstacles associated with production environments like testing, versioning, scaling and security. It can be used to on customer acquisition, retention, and churn or to in patients. Random forest is capable of regression and classification. It can handle a large number of features, and it's helpful for estimating which of your variables are important in the underlying data being modeled.
How to Implement Bagging From Scratch With Python - Machine Learning Mastery
Decision trees are a simple and powerful predictive modeling technique, but they suffer from high-variance. This means that trees can get very different results given different training data. A technique to make decision trees more robust and to achieve better performance is called bootstrap aggregation or bagging for short. In this tutorial, you will discover how to implement the bagging procedure with decision trees from scratch with Python. How to apply bagging to your own predictive modeling problems.
A New Method for Classification of Datasets for Data Mining
Vijendra, Singh, Parashar, Hemjyotsana, Vasudeva, Nisha
Humans have been manually extracting patterns from data for centuries, but the increasing volume of data in modern times has called for more automated approaches. Information leads to power and success, and thanks to sophisticated technologies such as computers, satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of computers and means for mass digital storage, we started collecting and storing all sorts of data, counting on the power of computers to help sort through this amalgam of information. Unfortunately, these massive collections of data stored on disparate structures very rapidly became overwhelming. A variety of information collected in digital form in databases and in flat files.
Top Data Mining Algorithms Identified by IEEE & Related Python Resources
C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}), where the x_j represent attributes or features of the sample, as well as the class in which s_i falls.
Logarithmic Time One-Against-Some
Daume, Hal III, Karampatziakis, Nikos, Langford, John, Mineiro, Paul
We create a new online reduction of multiclass classification to binary classification for which training and prediction time scale logarithmically with the number of classes. Compared to previous approaches, we obtain substantially better statistical performance for two reasons: First, we prove a tighter and more complete boosting theorem, and second we translate the results more directly into an algorithm. We show that several simple techniques give rise to an algorithm that can compete with one-against-all in both space and predictive power while offering exponential improvements in speed when the number of classes is large.
Auditing Black-box Models for Indirect Influence
Adler, Philip, Falk, Casey, Friedler, Sorelle A., Rybeck, Gabriel, Scheidegger, Carlos, Smith, Brandon, Venkatasubramanian, Suresh
Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attributes (like race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the dataset, without knowing how the models work. Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all. Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection, as well as against other black-box auditing procedures.
Machine learning as a service ? Might lose sleep over this !
This post is'not' intended to teach people how to use popular predictive modelling APIs for free. Although, to your surprise, this isn't a far fetched possibility. Trained Machine learning models are basically a function that maps feature vectors to the output variable. Upon querying with a test instance, the model predicts an outcome, assigning probability scores to all the possible classes. Google, Amazon etc provides public facing APIs to train predictive models on the subscriber's data, the model can further be used for prediction purposes .
Titanic: Machine Learning from Disaster
If you're new to data science and machine learning, or looking for a simple intro to the Kaggle competitions platform, this is the best place to start. Continue reading below the competition description to discover a number of tutorials, benchmark models, and more. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
When Does Deep Learning Work Better Than SVMs or Random Forests?
Guest blog by Sebastian Raschka, originally posted here. If we tackle a supervised learning problem, my advice is to start with the simplest hypothesis space first. I.e., try a linear model such as logistic regression. If this doesn't work "well" (i.e., it doesn't meet our expectation or performance criterion that we defined earlier), I would move on to the next experiment. I would say that random forests are probably THE "worry-free" approach - if such a thing exists in ML: There are no real hyperparameters to tune (maybe except for the number of trees; typically, the more trees we have the better).
Machine Learning with Talend - Getting Started
Decision trees are used extensively in machine learning because they are easy to use, easy to interpret, and easy to operationalize. KD Nuggets, one of the most respected sites for data science and machine learning, recently published an article that identified decision trees as a "top 10" algorithm for machine learning. If you are new to machine learning, some of these concepts may be unfamiliar. The goal of this blog is to provide you with the basics of decision trees using Talend and Apache Spark. If you want to learn more about advanced analytics, please see the references section below.(2)