Goto

Collaborating Authors

 Ensemble Learning


Fr\'echet random forests

arXiv.org Machine Learning

Random forests are a statistical learning method widely used in many areas of scientific research essentially for its ability to learn complex relationship between input and output variables and also its capacity to handle high-dimensional data. However, data are increasingly complex with repeated measures of omics, images leading to shapes, curves... Random forests method is not specifically tailored for them. In this paper, we introduce Fr\'echet trees and Fr\'echet random forests, which allow to manage data for which input and output variables take values in general metric spaces (which can be unordered). To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. Finally, the method is studied in the special case of regression on curve shapes, both within a simulation study and a real dataset from an HIV vaccine trial.


Inverse boosting pruning trees for depression detection on Twitter

arXiv.org Machine Learning

Depression is one of the most common mental health disorders, and a large number of depression people commit suicide each year. Potential depression sufferers do not consult psychological doctors because they feel ashamed or are unaware of any depression, which may result in severe delay of diagnosis and treatment. In the meantime, evidence shows that social media data provides valuable clues about physical and mental health conditions. In this paper, we argue that it is feasible to identify depression at an early stage by mining online social behaviours. Our approach, which is innovative to the practice of depression detection, does not rely on the extraction of numerous or complicated features to achieve accurate depression detection. Instead, we propose a novel classifier, namely, Inverse Boosting Pruning Trees (IBPT), which demonstrates a strong classification ability on a publicly accessible dataset with 7862 Twitter users. To comprehensively evaluate the classification capability of the IBPT, we use three real datasets from the UCI machine learning repository and the IBPT still obtains the best classification results against several state of the arts techniques. The results manifest that our proposed framework is promising for identifying social networks' users with depression.


Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting

arXiv.org Machine Learning

Abstract: Predicting traffic incident duration is a major challenge for many traffic centres around the world. Most research studies focus on predicting the incident duration on motorways rather than arterial roads, due to a high network complexity and lack of data. In this paper we propose a bi-level framework for predicting the accident duration on arterial road networks in Sydney, based on operational requirements of incident clearance target which is less than 45 minutes. Using incident baseline information, we first deploy a classification method using various ensemble tree models in order to predict whether a new incident will be cleared in less than 45min or not. If the incident was classified as short-term, then various regression models are developed for predicting the actual incident duration in minutes by incorporating various traffic flow features. After outlier removal and intensive model hyper-parameter tuning through randomized search and cross-validation, we show that the extreme gradient boost approach outperformed all models, including the gradient-boosted decision-trees by almost 53%. Finally, we perform a feature importance evaluation for incident duration prediction and show that the best prediction results are obtained when leveraging the real-time traffic flow in vicinity road sections to the reported accident location. Initial methods used to predict the incident duration were 1. Introduction Bayesian classifiers [5], discrete choice models (DCM) [6], probabilistic distribution analyses [7], and the hazard-based Traffic congestion is a major concern for many cities duration models (HBDM) [8].


The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial

arXiv.org Machine Learning

In this tutorial paper, we first define mean squared error, variance, covariance, and bias of both random variables and classification/predictor models. Then, we formulate the true and generalization errors of the model for both training and validation/test instances where we make use of the Stein's Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and generalization using the obtained true and generalization errors. We introduce cross validation and two well-known examples which are $K$-fold and leave-one-out cross validations. We briefly introduce generalized cross validation and then move on to regularization where we use the SURE again. We work on both $\ell_2$ and $\ell_1$ norm regularizations. Then, we show that bootstrap aggregating (bagging) reduces the variance of estimation. Boosting, specifically AdaBoost, is introduced and it is explained as both an additive model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting is also provided to show why boosting prevents from overfitting. As examples of regularization, the theory of ridge and lasso regressions, weight decay, noise injection to input/weights, and early stopping are explained. Random forest, dropout, histogram of oriented gradients, and single shot multi-box detector are explained as examples of bagging in machine learning and computer vision. Finally, boosting tree and SVM models are mentioned as examples of boosting.


Best-scored Random Forest Classification

arXiv.org Machine Learning

We propose an algorithm named best-scored random forest for binary classification problems. The terminology "best-scored" means to select the one with the best empirical performance out of a certain number of purely random tree candidates as each single tree in the forest. In this way, the resulting forest can be more accurate than the original purely random forest. From the theoretical perspective, within the framework of regularized empirical risk minimization penalized on the number of splits, we establish almost optimal convergence rates for the proposed best-scored random trees under certain conditions which can be extended to the best-scored random forest. In addition, we present a counterexample to illustrate that in order to ensure the consistency of the forest, every dimension must have the chance to be split. In the numerical experiments, for the sake of efficiency, we employ an adaptive random splitting criterion. Comparative experiments with other state-of-art classification methods demonstrate the accuracy of our best-scored random forest.


Asymptotic Distributions and Rates of Convergence for Random Forests and other Resampled Ensemble Learners

arXiv.org Machine Learning

Random forests remain among the most popular off-the-shelf supervised learning algorithms. Despite their well-documented empirical success, however, until recently, few theoretical results were available to describe their performance and behavior. In this work we push beyond recent work on consistency and asymptotic normality by establishing rates of convergence for random forests and other supervised learning ensembles. We develop the notion of generalized U-statistics and show that within this framework, random forest predictions remain asymptotically normal for larger subsample sizes than previously established. We also provide Berry-Esseen bounds in order to quantify the rate at which this convergence occurs, making explicit the roles of the subsample size and the number of trees in determining the distribution of random forest predictions.


HDI-Forest: Highest Density Interval Regression Forest

arXiv.org Machine Learning

By seeking the narrowest prediction intervals (PIs) that satisfy the specified coverage probability requirements, the recently proposed quality-based PI learning principle can extract high-quality PIs that better summarize the predictive certainty in regression tasks, and has been widely applied to solve many practical problems. Currently, the state-of-the-art quality-based PI estimation methods are based on deep neural networks or linear models. In this paper, we propose Highest Density Interval Regression Forest (HDI-Forest), a novel quality-based PI estimation method that is instead based on Random Forest. HDI-Forest does not require additional model training, and directly reuses the trees learned in a standard Random Forest model. By utilizing the special properties of Random Forest, HDI-Forest could efficiently and more directly optimize the PI quality metrics. Extensive experiments on benchmark datasets show that HDI-Forest significantly outperforms previous approaches, reducing the average PI width by over 30\% while achieving the same or better coverage probability.


sabiha90/Random-Forest-Explainability-Pipeline

#artificialintelligence

This toolkit serves to execute RFEX 2.0 "pipeline" e.g. a set of steps to produce information which comprises RFEX 2.0 summary namely information to enhance explainability of Random Forest classifier. It comes with the synthetically generated test database which helps to demonstrate how RFEX 2.0 works. Wth this toolkit users can also use their own data to generate RFEX 2.0 summary. Background of the RFEX 2.0 method, as well as the description and access to the synthetic test database convenient to test and demonstrate can be found in TR 18.01 at cs.sfsu.edu Users are strongly advised to read the above report before using this toolkit.



Enterprise AI: Diving into Machine Learning

#artificialintelligence

Data in the real world, of course, isn't as simple as it is in the previous example. There are always complexities and nuances to data. To stick with our housing market example, the value of houses might also be influenced by dwelling type, lot size, recent upgrades, proximity to a neighborhood park and intangible variables like curbside appeal. And, in the real world, houses wouldn't all be in the same neighborhood, so your machine learning model must also consider the ZIP code for the property. To consider this wider range of variables, we need to dig deeper into the data scientist's toolbox and pull out some more sophisticated machine learning methods, including random forests and gradient boosting.