Goto

Collaborating Authors

 Decision Tree Learning


RFCDE: Random Forests for Conditional Density Estimation

arXiv.org Machine Learning

Random forests is a common non-parametric regression technique which performs well for mixed-type data and irrelevant covariates, while being robust to monotonic variable transformations. Existing random forest implementations target regression or classification. We introduce the RFCDE package for fitting random forest models optimized for nonparametric conditional density estimation, including joint densities for multiple responses. This enables analysis of conditional probability distributions which is useful for propagating uncertainty and of joint distributions that describe relationships between multiple responses and covariates. RFCDE is released under the MIT open-source license and can be accessed at https://github.com/tpospisi/rfcde . Both R and Python versions, which call a common C++ library, are available.


Decision Tree Design for Classification in Crowdsourcing Systems

arXiv.org Machine Learning

In recent work on classification in crowdsourcing systems, complex questions are often replaced by a set of simpler binary questions (microtasks) to enhance classification performance [1]-[4]. This is especially helpful in situations where crowd workers lack expertise for responding to complex questions directly. Each worker is given the entire set of questions in a batch mode and the workers provide their responses in the form of a vector. These binary questions can be posted as "microtasks" on crowdsourcing platforms like Amazon Mechanical Turk [5]. To improve classification performance in crowdsourcing systems, most of the works in the literature focus on enhancing the quality of individual tests, by designing fusion rules to combine decisions from heterogeneous workers [1]-[4], [6], [7], and by investigating the assignment of different tests to different workers depending upon their skill level [8], [9].


Gradient Boosting vs Random Forest โ€“ Abolfazl Ravanshad โ€“ Medium

#artificialintelligence

In this post, I am going to compare two popular ensemble methods, Random Forests (RM) and Gradient Boosting Machine (GBM). GBM and RF both are ensemble learning methods and predict (regression or classification) by combining the outputs from individual trees (we assume tree-based GBM or GBT). They have all the strengths and weaknesses of the ensemble methods mentioned in my previous post. So, here we compare them only with respect to each other. GBM and RF differ in the way the trees are built: the order and the way the results are combined.


Handling Missing Values using Decision Trees with Branch-Exclusive Splits

arXiv.org Machine Learning

In this article we propose a new decision tree construction algorithm. The proposed approach allows the algorithm to interact with some predictors that are only defined in subspaces of the feature space. One way to utilize this new tool is to create or use one of the predictors to keep track of missing values. This predictor can later be used to define the subspace where predictors with missing values are available for the data partitioning process. By doing so, this new classification tree can handle missing values for both modelling and prediction. The algorithm is tested against simulated and real data. The result is a classification procedure that efficiently handles missing values and produces results that are more accurate and more interpretable than most common procedures.


Decision Tree Classification models to predict employee turnover

#artificialintelligence

In this project I have attempted to create supervised learning models to assist in classifying certain employee data. I pre-processed the data by removing one outlier and producing new features in Excel as the data set was small at 1056 rows. Some categorical features were also converted to numeric values in Excel. For example, Gender was originally "M" or "F", which was converted to 0 and 1 respectively. I also removed employee number as it provides no value as a feature and could compromise privacy.


MetaBags: Bagged Meta-Decision Trees for Regression

arXiv.org Machine Learning

Ensembles are popular methods for solving practical supervised learning problems. They reduce the risk of having underperforming models in production-grade software. Although critical, methods for learning heterogeneous regression ensembles have not been proposed at large scale, whereas in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel, practically useful stacking framework for regression. MetaBags is a meta-learning algorithm that learns a set of meta-decision trees designed to select one base model (i.e. expert) for each query, and focuses on inductive bias reduction. A set of meta-decision trees are learned using different types of meta-features, specially created for this purpose - to then be bagged at meta-level. This procedure is designed to learn a model with a fair bias-variance trade-off, and its improvement over base model performance is correlated with the prediction diversity of different experts on specific input space subregions. The proposed method and meta-features are designed in such a way that they enable good predictive performance even in subregions of space which are not adequately represented in the available training data. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on synthetic, open and real-world application datasets. The obtained results show that our method significantly outperforms existing state-of-the-art approaches.


LEARNING PATH: R: Machine Learning Algorithms with R

@machinelearnbot

Are you interested to explore advanced algorithm concepts such as random forest vector machine, K- nearest, and more through real-world examples? Packt's Video Learning Paths are a series of individual video products put together in a logical and stepwise manner such that each video builds on the skills learned in the video before it. Machine learning and data science are some of the top buzzwords in the technical world today. Machine learning - the application and science of algorithms that makes sense of data, is the most exciting field of all the computer sciences! It explores the study and construction of algorithms that can learn from and make predictions on data.


Data Mining with Rattle Udemy

@machinelearnbot

Data Mining with Rattle is a unique course that instructs with respect to both the concepts of data mining, as well as to the "hands-on" use of a popular, contemporary data mining software tool, "Data Miner," also known as the'Rattle' package in R software. Rattle is a popular GUI-based software tool which'fits on top of' R software. The course focuses on life-cycle issues, processes, and tasks related to supporting a'cradle-to-grave' data mining project. These include: data exploration and visualization; testing data for random variable family characteristics and distributional assumptions; transforming data by scale or by data type; performing cluster analyses; creating, analyzing and interpreting association rules; and creating and evaluating predictive models that may utilize: regression; generalized linear modeling (GLMs); decision trees; recursive partitioning; random forests; boosting; and/or support vector machine (SVM) paradigms. It is both a conceptual and a practical course as it teaches and instructs about data mining, and provides ample demonstrations of conducting data mining tasks using the Rattle R package. The course is ideal for undergraduate students seeking to master additional'in-demand' analytical job skills to offer a prospective employer.


Asynchronous Parallel Sampling Gradient Boosting Decision Tree

arXiv.org Machine Learning

With the development of big data technology, Gradient Boosting Decision Tree, i.e. GBDT, becomes one of the most important machine learning algorithms for its accurate output. However, the training process of GBDT needs a lot of computational resources and time. In order to accelerate the training process of GBDT, the asynchronous parallel sampling gradient boosting decision tree, abbr. asynch-SGBDT is proposed in this paper. Via introducing sampling, we adapt the numerical optimization process of traditional GBDT training process into stochastic optimization process and use asynchronous parallel stochastic gradient descent to accelerate the GBDT training process. Meanwhile, the theoretical analysis of asynch-SGBDT is provided by us in this paper. Experimental results show that GBDT training process could be accelerated by asynch-SGBDT. Our asynchronous parallel strategy achieves an almost linear speedup, especially for high-dimensional sparse datasets.


Create Your Own Sophisticated Model with Neural Networks

@machinelearnbot

Scikit-learn has evolved as a robust library for Machine Learning applications in Python with support for a wide range of Supervised and Unsupervised Learning Algorithms. With this course you will learn the Decision Tree algorithms and Ensemble Models to build Random Forest, Regression Analysis. You will focus on Decision Trees and Ensemble Algorithms. Moving forward, you learn to use scikit-learn to classify text and Multiclass with scikit-learn. You will explore various algorithms for classification.