Goto

Collaborating Authors

 Decision Tree Learning


Fundamentals of Decision Trees in Machine Learning

@machinelearnbot

A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering both classification and regression. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. If you're working towards an understanding of machine learning, it's important to know how to work with decision trees. This course covers the essentials of machine learning, including predictive analytics and working with decision trees. In this course, we'll explore several popular tree algorithms and learn how to use reverse engineering to identify specific variables.


A Simple and Effective Model-Based Variable Importance Measure

arXiv.org Machine Learning

In the era of "big data", it is becoming more of a challenge to not only build state-of-the-art predictive models, but also gain an understanding of what's really going on in the data. For example, it is often of interest to know which, if any, of the predictors in a fitted model are relatively influential on the predicted outcome. Some modern algorithms---like random forests and gradient boosted decision trees---have a natural way of quantifying the importance or relative influence of each feature. Other algorithms---like naive Bayes classifiers and support vector machines---are not capable of doing so and model-free approaches are generally used to measure each predictor's importance. In this paper, we propose a standardized, model-based approach to measuring predictor importance across the growing spectrum of supervised learning algorithms. Our proposed method is illustrated through both simulated and real data examples. The R code to reproduce all of the figures in this paper is available in the supplementary materials.


Top 10 Machine Learning Algorithms for Beginners

#artificialintelligence

The study of ML algorithms has gained immense traction post the Harvard Business Review article terming a'Data Scientist' as the'Sexiest job of the 21st century'. So, for those starting out in the field of ML, we decided to do a reboot of our immensely popular Gold blog The 10 Algorithms Machine Learning Engineers need to know -- albeit this post is targeted towards beginners. ML algorithms are those that can learn from data and improve from experience, without human intervention. Learning tasks may include learning the function that maps the input to the output, learning the hidden structure in unlabeled data; or'instance-based learning', where a class label is produced for a new instance by comparing the new instance (row) to instances from the training data, which were stored in memory. 'Instance-based learning' does not create an abstraction from specific instances. Supervised learning can be explained as follows: use labeled training data to learn the mapping function from the input variables (X) to the output variable (Y).


Opinion Fraud Detection via Neural Autoencoder Decision Forest

arXiv.org Artificial Intelligence

Online reviews play an important role in influencing buyers' daily purchase decisions. However, fake and meaningless reviews, which cannot reflect users' genuine purchase experience and opinions, widely exist on the Web and pose great challenges for users to make right choices. Therefore,it is desirable to build a fair model that evaluates the quality of products by distinguishing spamming reviews. We present an end-to-end trainable unified model to leverage the appealing properties from Autoencoder and random forest. A stochastic decision tree model is implemented to guide the global parameter learning process. Extensive experiments were conducted on a large Amazon review dataset. The proposed model consistently outperforms a series of compared methods.


Introduction to Machine Learning for non-developers

#artificialintelligence

There are several types of predictive models. These models usually have several input columns and one target or outcome column, which is the variable to be predicted. So basically, a model performs mapping between inputs and an output, finding-mysteriously, sometimes-the relationships between the input variables in order to predict any other variable. As you may notice, it has some commonalities with a human being who reads the environment processes the information and performs a certain action. It's about becoming familiar with one of the most-used predictive models: Random Forest (official algorithm site), implemented in R, one of the most-used models due to its simplicity in tuning and robustness across many different types of data.


Fighting Accounting Fraud Through Forensic Data Analytics

arXiv.org Machine Learning

Accounting fraud is a global concern representing a significant threat to the financial system stability due to the resulting diminishing of the market confidence and trust of regulatory authorities. Several tricks can be used to commit accounting fraud, hence the need for non-static regulatory interventions that take into account different fraudulent patterns. Accordingly, this study aims to improve the detection of accounting fraud via the implementation of several machine learning methods to better differentiate between fraud and non-fraud companies, and to further assist the task of examination within the riskier firms by evaluating relevant financial indicators. Out-of-sample results suggest there is a great potential in detecting falsified financial statements through statistical modelling and analysis of publicly available accounting information. The proposed methodology can be of assistance to public auditors and regulatory agencies as it facilitates auditing processes, and supports more targeted and effective examinations of accounting reports.


Wavelet Decomposition of Gradient Boosting

arXiv.org Machine Learning

In this paper we introduce a significant improvement to the popular tree-based Stochastic Gradient Boosting algorithm using a wavelet decomposition of the trees. This approach is based on harmonic analysis and approximation theoretical elements, and as we show through extensive experimentation, our wavelet based method generally outperforms existing methods, particularly in difficult scenarios of class unbalance and mislabeling in the training data.


Complete Analysis of a Random Forest Model

arXiv.org Machine Learning

Random forests have become an important tool for improving accuracy in regression problems since their popularization by (Breiman, 2001) and others. In this paper, we revisit a random forest model originally proposed by (Breiman, 2004) and later studied by (Biau, 2012), where a feature is selected at random and the split occurs at the midpoint of the block containing the chosen feature. If the regression function is sparse and depends only on a small, unknown subset of $ S $ out of $ d $ features, we show that given $ n $ observations, this random forest model outputs a predictor that has a mean-squared prediction error of order $ \left(n\sqrt{\log^{S-1} n}\right)^{-\frac{1}{S\log2+1}} $. When $ S \leq \lfloor 0.72 d \rfloor $, this rate is better than the minimax optimal rate $ n^{-\frac{2}{d+2}} $ for $ d $-dimensional, Lipschitz function classes. As a consequence of our analysis, we show that the variance of the forest decays with the depth of the tree at a rate that is independent of the ambient dimension, even when the trees are fully grown. In particular, if $ \ell_{avg} $ (resp. $ \ell_{max} $) is the average (resp. maximum) number of observations per leaf node, we show that the variance of this forest is $ \Theta\left(\ell^{-1}_{avg}(\sqrt{\log n})^{-(S-1)}\right) $, which for the case of $ S = d $, is similar in form to the lower bound $ \Omega\left(\ell^{-1}_{max}(\log n)^{-(d-1)}\right) $ of (Lin and Jeon, 2006) for any random forest model with a nonadaptive splitting scheme. We also show that the bias is tight for any linear model with nonzero parameter vector. Thus, we completely characterize the fundamental limits of this random forest model. Our new analysis also implies that better theoretical performance can be achieved if the trees are grown less aggressively (i.e., grown to a shallower depth) than previous work would otherwise recommend.


Predicting animal adoption with Random Forest, SVM

@machinelearnbot

Joanne Lin, a student at Thinkful's data science bootcamp, decided to jump in and find insights that can help shelters get more pets rescued.


Comparison of Classical and Nonlinear Models for Short-Term Electricity Price Prediction

arXiv.org Machine Learning

Electricity is bought and sold in wholesale markets at prices that fluctuate significantly. Short-term forecasting of electricity prices is an important endeavor because it helps electric utilities control risk and because it influences competitive strategy for generators. As the "smart grid" grows, short-term price forecasts are becoming an important input to bidding and control algorithms for battery operators and demand response aggregators. While the statistics and machine learning literature offers many proposed methods for electricity price prediction, there is no consensus supporting a single best approach. We test two contrasting machine learning approaches for predicting electricity prices, regression decision trees and recurrent neural networks (RNNs), and compare them to a more traditional ARIMA implementation. We conduct the analysis on a challenging dataset of electricity prices from ERCOT, in Texas, where price fluctuation is especially high. We find that regression decision trees in particular achieves high performance compared to the other methods, suggesting that regression trees should be more carefully considered for electricity price forecasting.