Goto

Collaborating Authors

 Decision Tree Learning


Lossless (and Lossy) Compression of Random Forests

arXiv.org Machine Learning

Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on random forests. Our suggested method is based on probabilistic modeling of the ensemble's trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble. In addition, we introduce a theoretically sound lossy compression scheme, which allows us to control the trade-off between the distortion and the coding rate.


A Text Classification Application: Poet Detection from Poetry

arXiv.org Machine Learning

With the widespread use of the internet, the size of the text data increases day by day. Poems can be given as an example of the growing text. In this study, we aim to classify poetry according to poet. Firstly, data set consisting of three different poetry of poets written in English have been constructed. Then, text categorization techniques are implemented on it. Chi-Square technique are used for feature selection. In addition, five different classification algorithms are tried. These algorithms are Sequential minimal optimization, Naive Bayes, C4.5 decision tree, Random Forest and k-nearest neighbors. Although each classifier showed very different results, over the 70% classification success rate was taken by sequential minimal optimization technique.


Machine Learning and Credit Risk Analytics

#artificialintelligence

In the last few years, new statistical algorithms have become very popular. Traditional scorecards were based on one decision tree, or "logistic regression." The newer algorithms represent a combination of hundreds of decision trees instead of one single tree. These algorithms also provide much more accurate predictions compared to traditional methods. The current hype around machine learning methods typically revolves around these algorithms in particular: random forests, XgBoost, and deep learning.


What is a Decision Tree in Machine Learning? โ€“ Hacker Noon

#artificialintelligence

Decision trees, as the name implies, are trees of decisions. You have a question, usually a yes or no (binary; 2 options) question with two branches (yes and no) leading out of the tree. You can get more options than 2, but for this article, we're only using 2 options. Trees are weird in computer science. Instead of growing from a root upwards, they grow downwards.


A Statistical Approach to Adult Census Income Level Prediction

arXiv.org Machine Learning

The prominent inequality of wealth and income is a huge concern especially in the United States. The likelihood of diminishing poverty is one valid reason to reduce the world's surging level of economic inequality. The principle of universal moral equality ensures sustainable development and improve the economic stability of a nation. Governments in different countries have been trying their best to address this problem and provide an optimal solution. This study aims to show the usage of machine learning and data mining techniques in providing a solution to the income equality problem. The UCI Adult Dataset has been used for the purpose. Classification has been done to predict whether a person's yearly income in US falls in the income category of either greater than 50K Dollars or less equal to 50K Dollars category based on a certain set of attributes. The Gradient Boosting Classifier Model was deployed which clocked the highest accuracy of 88.16%, eventually breaking the benchmark accuracy of existing works.


On PAC-Bayesian Bounds for Random Forests

arXiv.org Machine Learning

Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments.


Comparative Evaluation of Tree-Based Ensemble Algorithms for Short-Term Travel Time Prediction

arXiv.org Artificial Intelligence

Disseminating accurate travel time information to road users helps achieve traffic equilibrium and reduce traffic congestion. The deployment of Connected Vehicles technology will provide unique opportunities for the implementation of travel time prediction models. The aim of this study is twofold: (1) estimate travel times in the freeway network at five-minute intervals using Basic Safety Messages (BSM); (2) develop an eXtreme Gradient Boosting (XGB) model for short-term travel time prediction on freeways. The XGB tree-based ensemble prediction model is evaluated against common tree-based ensemble algorithms and the evaluations are performed at five-minute intervals over a 30-minute horizon. BSMs generated by the Safety Pilot Model Deployment conducted in Ann Arbor, Michigan, were used. Nearly two billion messages were processed for providing travel time estimates for the entire freeway network. A Combination of grid search and five-fold cross-validation techniques using the travel time estimates were used for developing the prediction models and tuning their parameters. About 9.6 km freeway stretch was used for evaluating the XGB together with the most common tree-based ensemble algorithms. The results show that XGB is superior to all other algorithms, followed by the Gradient Boosting. XGB travel time predictions were accurate and consistent with variations during peak periods, with mean absolute percentage error in prediction about 5.9% and 7.8% for 5-minute and 30-minute horizons, respectively. Additionally, through applying the developed models to another 4.7 km stretch along the eastbound segment of M-14, the XGB demonstrated its considerable advantages in travel time prediction during congested and uncongested conditions.


Assessing the Stability of Interpretable Models

arXiv.org Artificial Intelligence

Interpretable classification models are built with the purpose of providing a comprehensible description of the decision logic to an external oversight agent. When considered in isolation, a decision tree, a set of classification rules, or a linear model, are widely recognized as human-interpretable. However, such models are generated as part of a larger analytical process, which, in particular, comprises data collection and filtering. Selection bias in data collection or in data pre-processing may affect the model learned. Although model induction algorithms are designed to learn to generalize, they pursue optimization of predictive accuracy. It remains unclear how interpretability is instead impacted. We conduct an experimental analysis to investigate whether interpretable models are able to cope with data selection bias as far as interpretability is concerned.


Machine learning predicts World Cup winner

#artificialintelligence

The random-forest technique has emerged in recent years as a powerful way to analyze large data sets while avoiding some of the pitfalls of other data-mining methods. It is based on the idea that some future event can be determined by a decision tree in which an outcome is calculated at each branch by reference to a set of training data. However, decision trees suffer from a well-known problem. In the latter stages of the branching process, decisions can become severely distorted by training data that is sparse and prone to huge variation at this kind of resolution, a problem known as overfitting. The random-forest approach is different.


Gradient Boosting Decision trees: XGBoost vs LightGBM

#artificialintelligence

Gradient boosting decision trees is the state of the art for structured data problems. Two modern algorithms that make gradient boosted tree models are XGBoost and LightGBM. In this article I'll summarize their introductory papers for each algorithm's approach. Gradient Boosting Decision Trees (GBDT) are currently the best techniques for building predictive models from structured data. Unlike models for analyzing images (for that you want to use a deep learning model), structured data problems can be solved very well with a lot of decision trees.