Decision Tree Learning
Neural Regression Trees
Memon, Shahan Ali, Zhao, Wenbo, Raj, Bhiksha, Singh, Rita
Regression-via-Classification (RvC) is the process of converting a regression problem to a classification one. Current approaches for RvC use ad-hoc discretization strategies and are suboptimal. We propose a neural regression tree model for RvC. In this model, we employ a joint optimization framework where we learn optimal discretization thresholds while simultaneously optimizing the features for each node in the tree. We empirically show the validity of our model by testing it on two challenging regression tasks where we establish the state of the art.
Example of Random Forest application in Finance : Option Pricing
Let's assume we know how much Tesla share costs in 2W. Our'only' unknown is the future option value (Y_T), given all information we have at t 2W. In other terms, if you are in two weeks time (i.e. in the future), what's the expected value of your portfolio, made of this one american option. You have information at 2W and you want to predict the option value at 1M. Beforehand, we need to simulate multiple scenarios for Tesla share price. For model simplicity, we suppose Tesla Share follows a Geometric Brownian motion path with mean r (risk free rate) and volatility Sigma 20% (we refer interested readers to Stochastic processes theory).
How to visualize decision tree
The scikit tree does a good job of representing the tree structure, but we have a few quibbles. The colors aren't the best and it's not immediately obvious why some of the nodes are colored and some aren't. If the colors represent predicted class for this classifier, one would think just the leaves would be colored because only leaves have predictions. The count of samples of the various target classes in each node is somewhat useful, but a histogram would be even better. A target class color legend would be nice.
Introduction to Machine Learning for Coders: Launch ยท fast.ai
The course, recorded at the University of San Francisco as part of the Masters of Science in Data Science curriculum, covers the most important practical foundations for modern machine learning. There are 12 lessons, each of which is around two hours long--a list of all the lessons along with a screenshot from each is at the end of this post. There are some excellent machine learning courses already, most notably the wonderful Coursera course from Andrew Ng. But that course is showing its age now, particularly since it uses Matlab for coursework. This new course uses modern tools and libraries, including python, pandas, scikit-learn, and pytorch.
An overview of feature selection strategies
Feature selection and engineering are the most important factors which affect the success of predictive modeling. This remains true even today despite the success of deep learning, which comes with automatic feature engineering. Parsimonious and interpretable models provide simple insights into business problems and therefore they are deemed very valuable. Furthermore, in many occasions the underlying size and structure of the data being analyzed may not allow the use of complex models that have many parameters to tune. For example, in clinical settings where the number of samples is usually much lower than the number of features one could extract (e.g.
The Mathematics of Decision Trees, Random Forest and Feature Importance in Scikit-learn and Spark
This post attempts to consolidate information on tree algorithms and their implementations in Scikit-learn and Spark. In particular, it was written to provide clarification on how feature importance is calculated. There are many great resources online discussing how decision trees and random forests are created and this post is not intended to be that. Although it includes short definitions for context, it assumes the reader has a grasp on these concepts and wishes to know how the algorithms are implemented in Scikit-learn and Spark. Decision trees learn how to best split the dataset into smaller and smaller subsets to predict the target value.
Mobility Mode Detection Using WiFi Signals
Kalatian, Arash, Farooq, Bilal
We utilize Wi-Fi communications from smartphones to predict their mobility mode, i.e. walking, biking and driving. Wi-Fi sensors were deployed at four strategic locations in a closed loop on streets in downtown Toronto. Deep neural network (Multilayer Perceptron) along with three decision tree based classifiers (Decision Tree, Bagged Decision Tree and Random Forest) are developed. Results show that the best prediction accuracy is achieved by Multilayer Perceptron, with 86.52% correct predictions of mobility modes.
Decision Trees for Classification: A Machine Learning Algorithm
It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can identify features that have better power to predict target variable. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable. Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree. Data type is not a constraint: It can handle both numerical and categorical variables.
Perturb and Combine to Identify Influential Spreaders in Real-World Networks
Tixier, Antoine J. -P., Rossi, Maria-Evgenia G., Malliaros, Fragkiskos D., Read, Jesse, Vazirgiannis, Michalis
Recent research has shown that graph degeneracy algorithms, which decompose a network into a hierarchy of nested subgraphs of decreasing size and increasing density, are very effective at detecting the good spreaders in a network. However, it is also known that degeneracy-based decompositions of a graph are unstable to small perturbations of the network structure. In Machine Learning, the performance of unstable classification and regression methods, such as fully-grown decision trees, can be greatly improved by using Perturb and Combine (P&C) strategies such as bagging (bootstrap aggregating). Therefore, we propose a P&C procedure for networks that (1) creates many perturbed versions of a given graph, (2) applies a node scoring function separately to each graph (such as a degeneracy-based one), and (3) combines the results. We conduct real-world experiments on the tasks of identifying influential spreaders in large social networks, and influential words (keywords) in small word co-occurrence networks. We use the k-core, generalized k-core, and PageRank algorithms as our vertex scoring functions. In each case, using the aggregated scores brings significant improvements compared to using the scores computed on the original graphs. Finally, a bias-variance analysis suggests that our P&C procedure works mainly by reducing bias, and that therefore, it should be capable of improving the performance of all vertex scoring functions, not only unstable ones.
Proximity Forest: An effective and scalable distance-based classifier for time series
Lucas, Benjamin, Shifaz, Ahmed, Pelletier, Charlotte, O'Neill, Lachlan, Zaidi, Nayyar, Goethals, Bart, Petitjean, Francois, Webb, Geoffrey I.
Research into the classification of time series has made enormous progress in the last decade. The UCR time series archive has played a significant role in challenging and guiding the development of new learners for time series classification. The largest dataset in the UCR archive holds 10 thousand time series only; which may explain why the primary research focus has been in creating algorithms that have high accuracy on relatively small datasets. This paper introduces Proximity Forest, an algorithm that learns accurate models from datasets with millions of time series, and classifies a time series in milliseconds. The models are ensembles of highly randomized Proximity Trees. Whereas conventional decision trees branch on attribute values (and usually perform poorly on time series), Proximity Trees branch on the proximity of time series to one exemplar time series or another; allowing us to leverage the decades of work into developing relevant measures for time series. Proximity Forest gains both efficiency and accuracy by stochastic selection of both exemplars and similarity measures. Our work is motivated by recent time series applications that provide orders of magnitude more time series than the UCR benchmarks. Our experiments demonstrate that Proximity Forest is highly competitive on the UCR archive: it ranks among the most accurate classifiers while being significantly faster. We demonstrate on a 1M time series Earth observation dataset that Proximity Forest retains this accuracy on datasets that are many orders of magnitude greater than those in the UCR repository, while learning its models at least 100,000 times faster than current state of the art models Elastic Ensemble and COTE.