Decision Tree Learning
Opening the Black Box: Visualising Machine Learning Algorithms
These days machine learning is all the hype. Unfortunately, these algorithms are usually considered rather hard to interpret, leaving business stakeholders feeling queasy. I've seen analytics teams use these powerful tools to build exceptionally good models only to have them thrown in the scrap heap. People just didn't get them. And if they don't get them, they don't trust them.
Binarsity: a penalization for one-hot encoded features
Alaya, Mokhtar Z., Bussy, Simon, Gaïffas, Stéphane, Guilloux, Agathe
This paper deals with the problem of large-scale linear supervised learning in settings where a large number of continuous features are available. We propose to combine the well-known trick of one-hot encoding of continuous features with a new penalization called binarsity. In each group of binary features coming from the one-hot encoding of a single raw continuous feature, this penalization uses total-variation regularization together with an extra linear constraint to avoid collinearity within groups. Non-asymptotic oracle inequalities for generalized linear models are proposed, and numerical experiments illustrate the good performances of our approach on several datasets. It is also noteworthy that our method has a numerical complexity comparable to standard $\ell_1$ penalization.
Relief-Based Feature Selection: Introduction and Review
Urbanowicz, Ryan J., Meeker, Melissa, LaCava, William, Olson, Randal S., Moore, Jason H.
Feature selection plays a critical role in data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that strike an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.
ForestHash: Semantic Hashing With Shallow Random Forests and Tiny Convolutional Networks
Qiu, Qiang, Lezama, Jose, Bronstein, Alex, Sapiro, Guillermo
Hash codes are efficient data representations for coping with the ever growing amounts of data. In this paper, we introduce a random forest semantic hashing scheme that embeds tiny convolutional neural networks (CNN) into shallow random forests, with near-optimal information-theoretic code aggregation among trees. We start with a simple hashing scheme, where random trees in a forest act as hashing functions by setting `1' for the visited tree leaf, and `0' for the rest. We show that traditional random forests fail to generate hashes that preserve the underlying similarity between the trees, rendering the random forests approach to hashing challenging. To address this, we propose to first randomly group arriving classes at each tree split node into two groups, obtaining a significantly simplified two-class classification problem, which can be handled using a light-weight CNN weak learner. Such random class grouping scheme enables code uniqueness by enforcing each class to share its code with different classes in different trees. A non-conventional low-rank loss is further adopted for the CNN weak learners to encourage code consistency by minimizing intra-class variations and maximizing inter-class distance for the two random class groups. Finally, we introduce an information-theoretic approach for aggregating codes of individual trees into a single hash code, producing a near-optimal unique hash for each class. The proposed approach significantly outperforms state-of-the-art hashing methods for image retrieval tasks on large-scale public datasets, while performing at the level of other state-of-the-art image classification techniques while utilizing a more compact and efficient scalable representation. This work proposes a principled and robust procedure to train and deploy in parallel an ensemble of light-weight CNNs, instead of simply going deeper.
Using AI to Detect Electricity Theft - Rozee
Industrial NTL detection systems are still largely based on expert knowledge when deciding whether to carry out costly on-site inspections of customers. Electricity providers are reluctant to move to large-scale deployments of machine learning systems as the latter may suggest a large number of unnecessary inspections. Therefore, electricity providers want to understand why a specific customer was predicted to cause electricity theft or not. As a consequence, the models used should be interpretable, for example by using decision tree models rather than black box-like models such as deep learning. We have also recently proposed a method for visualizing prediction results at various granularity levels in a spatial hologram.
Deep learning for activity recognition
Human activity recognition (HAR) plays an important role in people's daily life by learning and identifying high-level knowledge about human activity from raw sensor inputs. Conventional pattern recognition approaches have made tremendous progress on HAR tasks by adopting machine learning algorithms such as decision tree, random forest or support vector machine, but the fast development and advancement of deep learning have overpass the accuracy of traditional machine learning results. This seminar is focused on Deep learning applied to HAR using wearable sensors. Current architectures used and how to implement them for achieving good results will be explained. Limitations and new challenges will be also discussed.
Tree-Structured Boosting: Connections Between Gradient Boosted Stumps and Full Decision Trees
Luna, José Marcio, Eaton, Eric, Ungar, Lyle H., Diffenderfer, Eric, Jensen, Shane T., Gennatas, Efstathios D., Wirth, Mateo, Simone, Charles B. II, Solberg, Timothy D., Valdes, Gilmer
Classification And Regression Tree (CART) analysis Breiman et al. [1984] is a well-established statistical learning technique, which has been adopted by numerous other fields for its model interpretability, scalability to large data sets, and connection to rule-based decision making Loh [2014]. CART builds a model by recursively partitioning the instance space, labeling each partition with either a predicted category (in the case of classification) or real-value (in the case of regression). Despite their widespread use, CART models often have lower predictive performance than other statistical learning models, such as kernel methods and ensemble techniques Caruana and Niculescu-Mizil [2006]. Among the latter, boosting methods were developed as a means to train an ensemble of weak learners (often CART models) iteratively into a high-performance predictive model, albeit with a loss of model interpretability. In particular, gradient boosting methods Friedman [2001] focus on iteratively optimizing an ensemble's prediction to increasingly match the labeled training data. Historically these two categories of approaches, CART and gradient boosting, have been studied separately, connected primarily through CART models being used as the weak learners in boosting. This paper investigates a deeper and surprising connection between full interaction models like CART and additive models like gradient boosting, showing that the resulting models exist upon a spectrum. In particular, this paper includes the following contributions: - We introduce tree-structured boosting (TSB) as a new mechanism for creating a hierarchical ensemble model that recursively partitions the instance space, forming a perfect binary tree of weak learners. Each path from the root node to a leaf represents the outcome of a gradient boosted stumps (GBS) ensemble for a particular partition of the instance space.
What is a decision tree and why should my chatbot use it?
The most effective way to discover the intent behind your customer's questions and provide the right answer is by using a decision tree. What are they and how do they work? When it comes to chatbots, businesses want to know one thing. The million dollar question for a market which will be worth billions within a few years is – can my virtual agent answer my customers' questions? Assuming your chatbot has robust natural language processing (NLP technology), the most effective way to do this is through decision trees. In the context of chatbots, a decision tree essentially helps them find the exact answer to your question.
ABC random forests for Bayesian parameter inference
Raynal, Louis, Marin, Jean-Michel, Pudlo, Pierre, Ribatet, Mathieu, Robert, Christian P., Estoup, Arnaud
This preprint has been reviewed and recommended by Peer Community In Evolutionary Biology (http://dx.doi.org/10.24072/pci.evolbiol.100036). Approximate Bayesian computation (ABC) has grown into a standard methodology that manages Bayesian inference for models associated with intractable likelihood functions. Most ABC implementations require the preliminary selection of a vector of informative statistics summarizing raw data. Furthermore, in almost all existing implementations, the tolerance level that separates acceptance from rejection of simulated parameter values needs to be calibrated. We propose to conduct likelihood-free Bayesian inferences about parameters with no prior selection of the relevant components of the summary statistics and bypassing the derivation of the associated tolerance level. The approach relies on the random forest methodology of Breiman (2001) applied in a (non parametric) regression setting. We advocate the derivation of a new random forest for each component of the parameter vector of interest. When compared with earlier ABC solutions, this method offers significant gains in terms of robustness to the choice of the summary statistics, does not depend on any type of tolerance level, and is a good trade-off in term of quality of point estimator precision and credible interval estimations for a given computing time. We illustrate the performance of our methodological proposal and compare it with earlier ABC methods on a Normal toy example and a population genetics example dealing with human population evolution. All methods designed here have been incorporated in the R package abcrf (version 1.7) available on CRAN.
Introduction to Machine Learning and Decision Trees - DATAVERSITY
Click to learn more about author Alejandro Correa Bahnsen. Almost everyone has heard the words "Machine Learning", but most people don't fully understand what they mean. Machine Learning isn't a single formula that is simply applied to a problem. There are many algorithms to choose from, each of which can be used to achieve different goals. This is the first in a series of articles that will introduce Machine Learning algorithms to help you understand how they work, and when to use each one.