Goto

Collaborating Authors

 Decision Tree Learning


Variable Selection with Random Survival Forest and Bayesian Additive Regression Tree for Survival Data

arXiv.org Machine Learning

In this paper we utilize a survival analysis methodology incorporating Bayesian additive regression trees to account for nonlinear and additive covariate effects. We compare the performance of Bayesian additive regression trees, Cox proportional hazards and random survival forests models for censored survival data, using simulation studies and survival analysis for breast cancer with U.S. SEER database for the year 2005. In simulation studies, we compare the three models across varying sample sizes and censoring rates on the basis of bias and prediction accuracy. In survival analysis for breast cancer, we retrospectively analyze a subset of 1500 patients having invasive ductal carcinoma that is a common form of breast cancer mostly affecting older woman. Predictive potential of the three models are then compared using some widely used performance assessment measures in survival literature.


Free Book: A Comprehensive Guide to Machine Learning (Berkeley University)

#artificialintelligence

This is not the same book as The Math of Machine Learning, also published by the same department at Berkeley, in 2018, and also authored by Garret Thomas. I hope they will add sections on Ensemble Methods (combining multiple techniques), cross-validation, and feature selection, and then it will cover pretty much everything that the beginner should know. Other popular free books, all written by top experts in their fields, include Foundations of Data Science published by Microsoft's ML Research Lab in 2018, and Statistics: New Foundations, Toolbox, and Machine Learning Recipes published by Data Science Central in 2019.


Silas: High Performance, Explainable and Verifiable Machine Learning

arXiv.org Machine Learning

Silas: High Performance, Explainable and V erifiable Machine Learning Hadrien Bride, Zh e H ou Griffith University, Nathan, Brisbane, Australia Jie Dong Dependable Intelligence Pty Ltd, Brisbane, Australia Jin Song Dong National University of Singapore, Singapore Ali Mirjalili Griffith University, Nathan, Brisbane, AustraliaAbstract This paper introduces a new classification tool named Silas, which is built to provide a more transparent and dependable data analytics service. A focus of Silas is on providing a formal foundation of decision trees in order to support logical analysis and verification of learned prediction models. This paper describes the distinct features of Silas: The Model Audit module formally verifies the prediction model against user specifications, the Enforcement Learning module trains prediction models that are guaranteed correct, the Model Insight and Prediction Insight modules reason about the prediction model and explain the decision-making of predictions. We also discuss implementation details ranging from programming paradigm to memory management that help achieve high-performance computation.1. Introduction Machine learning has enjoyed great success in many research areas and industries, including entertainment [1], self-driving cars [2], banking [3], medical diagnosis [4], shopping [5], and among many others. However, the wide adoption of machine learn-Preprint submitted to Elsevier October 4, 2019 arXiv:1910.01382v1 The ramifications of the black-box approach are multifold. First, it may lead to unexpected results that are only observable after the deployment of the algorithm. For instance, Amazon's Alexa offered porn to a child [6], a self-driving car had a deadly accident [7], etc. Some of these accidents result in lawsuits or even lost lives, the cost of which is immeasurable. Second, it prevents the adoption in some applications and industries where an explanation is mandatory or certain specifications must be satisfied. For example, in some countries, it is required by law to give a reason why a loan application is rejected. In recent years, eXplainable AI (XAI) has been gaining attention, and there is a surge of interest in studying how prediction models work and how to provide formal guarantees for the models. A common theme in this space is to use statistical methods to analyse prediction models.


Using Machine Learning in Venture Capital

#artificialintelligence

I have already (partially) reviewed previous studies where data have been proved to help identify signals that are relevant to assess the success potential of a startup. Even though the list is quite comprehensive, every study usually tends to look at one single factor and a couple of different success scenarios (namely, acquisition and IPO). In our work, we tried to have a more holistic view and use over 120,000 companies to spot signals not only for acquisitions and IPOs but also to compute the probability of raising a subsequent round of funding or shutting the startup down. In the same fashion as backtesting, we created a time-aware approach and analyzed companies that were no older than four years old by 2015 and tried to predict their success in the following three years. We also used more than a hundred variables as possible explanatory indicators of success, as well as five different models: Support Vector Machines (SVM); Decision Trees (DT); Random Forests (RF); Extremely Randomized Trees (ERT); and Gradient Tree Boosting (GTB).


Lead Data Scientist ai-jobs.net

#artificialintelligence

CenturyLink (NYSE: CTL) is the second largest U.S. communications provider to global enterprise customers. With customers in more than 60 countries and an intense focus on the customer experience, CenturyLink strives to be the world's best networking company by solving customers' increased demand for reliable and secure connections. The company also serves as its customers' trusted partner, helping them manage increased network and IT complexity and providing managed network and cyber security solutions that help protect their business. Key member of the Ops Transformation team who will support our Field Operations and Service Assurance teams with insights gained from analyzing company data. The ideal candidate must have strong experience using a variety of data mining/data analysis methods, using a variety of data tools, building and implementing models, using/creating algorithms and delivering solutions to drive KPI improvements.


Affordable Uplift: Supervised Randomization in Controlled Experiments

arXiv.org Machine Learning

Customer scoring models are the core of scalable direct marketing. Uplift models provide an estimate of the incremental benefit from a treatment that is used for operational decision-making. Training and monitoring of uplift models require experimental data. However, the collection of data under randomized treatment assignment is costly, since random targeting deviates from an established targeting policy. To increase the cost-efficiency of experimentation and facilitate frequent data collection and model training, we introduce supervised randomization. It is a novel approach that integrates existing scoring models into randomized trials to target relevant customers, while ensuring consistent estimates of treatment effects through correction for active sample selection. An empirical Monte Carlo study shows that data collection under supervised randomization is cost-efficient, while downstream uplift models perform competitively.


Locally Constant Networks

arXiv.org Machine Learning

A BSTRACT We show how neural models can be used to realize piece-wise constant functions such as decision trees. Our approach builds on ReLU networks that are piece-wise linear and hence their associated gradients with respect to the inputs are locally constant. We formally establish the equivalence between the classes of locally constant networks and decision trees. Moreover, we highlight several advantageous properties of locally constant networks, including how they realize decision trees with parameter sharing across branching / leaves. Indeed, only M neurons suffice to implicitly model an oblique decision tree with 2 M leaf nodes. The neural representation also enables us to adopt many tools developed for deep networks (e.g., DropConnect (Wan et al., 2013)) while implicitly training decision trees. We demonstrate that our method outperforms alternative techniques for training oblique decision trees in the context of molecular property classification and regression tasks. 1 I NTRODUCTION Decision trees (Breiman et al., 1984) employ a series of simple decision nodes, arranged in a tree, to transparently capture how the predicted outcome is reached. Functionally, such tree-based models, including random forest (Breiman, 2001), realize piece-wise constant functions. Beyond their status as de facto interpretable models, they have also persisted as the state of the art models in some tabular (Sandulescu & Chiru, 2016) and chemical datasets (Wu et al., 2018). Deep neural models, in contrast, are highly flexible and continuous, demonstrably effective in practice, though lack transparency. We merge these two contrasting views by introducing a new family of neural models that implicitly learn and represent oblique decision trees. Prior work has attempted to generalize classic decision trees by extending coordinate-wise cuts to be weighted, linear classifications.


Residual Networks Behave Like Boosting Algorithms

arXiv.org Machine Learning

We show that Residual Networks (ResNet) is equivalent to boosting feature representation, without any modification to the underlying ResNet training algorithm. A regret bound based on Online Gradient Boosting theory is proved and suggests that ResNet could achieve Online Gradient Boosting regret bounds through neural network architectural changes with the addition of a shrinkage parameter in the identity skip-connections and using residual modules with max-norm bounds. Through this relation between ResNet and Online Boosting, novel feature representation boosting algorithms can be constructed based on altering residual modules. We demonstrate this through proposing decision tree residual modules to construct a new boosted decision tree algorithm and demonstrating generalization error bounds for both approaches; relaxing constraints within BoostResNet algorithm to allow it to be trained in an out-of-core manner. We evaluate convolution ResNet with and without shrinkage modifications to demonstrate its efficacy, and demonstrate that our online boosted decision tree algorithm is comparable to state-of-the-art offline boosted decision tree algorithms without the drawback of offline approaches.


Scheduling optimization of parallel linear algebra algorithms using Supervised Learning

arXiv.org Machine Learning

Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations.


What Data Scientists should know about Multi-output and Multi-label Training

#artificialintelligence

It has the multivariate nature and the multiple outputs may have complex interactions, architected to be handled by structured inference. The output values have diverse data types, depending on the type of ML problem.For example, In Multi-output pattern recognition problems, each instance in the dataset have two or more output values (nominal or real-valued)-- i.e., the output value is a vector rather than a scalar. Here in this blog, we discuss about a Mixed/Multi-target RandomForest model, that supports multi-output problems with multiple classification outputs, multiple regression outputs, as well as arbitrary joint classification-regression outputs. Further the algorithm provides support for mixed-task multi-task learning, i.e., it is possible to train the model on any number of classification tasks and regression tasks, simultaneously. The Random Forest predictor lets each individual ensemble member vote for the most probable output according to its learned decision rule.