Goto

Collaborating Authors

 Regression


Random Forests for Big Data

arXiv.org Machine Learning

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.


Targeting Bayes factors with direct-path non-equilibrium thermodynamic integration

arXiv.org Machine Learning

Thermodynamic integration (TI) for computing marginal likelihoods is based on an inverse annealing path from the prior to the posterior distribution. In many cases, the resulting estimator suffers from high variability, which particularly stems from the prior regime. When comparing complex models with differences in a comparatively small number of parameters, intrinsic errors from sampling fluctuations may outweigh the differences in the log marginal likelihood estimates. In the present article, we propose a thermodynamic integration scheme that directly targets the log Bayes factor. The method is based on a modified annealing path between the posterior distributions of the two models compared, which systematically avoids the high variance prior regime. We combine this scheme with the concept of non-equilibrium TI to minimise discretisation errors from numerical integration. Results obtained on Bayesian regression models applied to standard benchmark data, and a complex hierarchical model applied to biopathway inference, demonstrate a significant reduction in estimator variance over state-of-the-art TI methods.


Loss function for Logistic Regression

#artificialintelligence

If we are doing a binary classification using logistic regression, we often use the cross entropy function as our loss function. Question: However, if we are doing linear regression, we often use squared-error as our loss function. Are there any specific reasons for using the cross entropy function instead of using squared-error or the classification error in logistic regression? I read somewhere that, if we use squared-error for binary classification, the resulting loss function would be non-convex. Is this the only reason reason, or is there any other deeper reason which I am missing?


Intuitive Machine Learning : Gradient Descent Simplified โ€“ YOU CANalytics

#artificialintelligence

They learn the same way as humans. Humans learn from experience and so do machines. For machines, experience is in the form of data. Machines use powerful algorithms to make sense of the data. They identify underlining patterns within the data to learn things about the world.


Ideas on interpreting machine learning

#artificialintelligence

For more on advances in machine learning, prediction, and technology, check out the Data science and advanced analytics sessions at Strata Hadoop World London, May 22-25, 2017. Early price ends April 7. You've probably heard by now that machine learning algorithms can use big data to predict whether a donor will give to a charity, whether an infant in a NICU will develop sepsis, whether a customer will respond to an ad, and on and on. Machine learning can even drive cars and predict elections. I believe it can, but these recent high-profile hiccups should leave everyone who works with data (big or not) and machine learning algorithms asking themselves some very hard questions: do I understand my data? Do I understand the model and answers my machine learning algorithm is giving me? And do I trust these answers? Unfortunately, the complexity that bestows the extraordinary predictive abilities on machine learning algorithms also makes the answers the algorithms produce hard to ...


Selective Harvesting over Networks

arXiv.org Machine Learning

Active search (AS) on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights under a query budget. However, in most networks, nodes, topology and edge weights are all initially unknown. We introduce selective harvesting, a variant of AS where the next node to be queried must be chosen among the neighbors of the current queried node set; the available training data for deciding which node to query is restricted to the subgraph induced by the queried set (and their node attributes) and their neighbors (without any node or edge attributes). Therefore, selective harvesting is a sequential decision problem, where we must decide which node to query at each step. A classifier trained in this scenario suffers from a tunnel vision effect: without recourse to independent sampling, the urge to query promising nodes forces classifiers to gather increasingly biased training data, which we show significantly hurts the performance of AS methods and standard classifiers. We find that it is possible to collect a much larger set of targets by using multiple classifiers, not by combining their predictions as an ensemble, but switching between classifiers used at each step, as a way to ease the tunnel vision effect. We discover that switching classifiers collects more targets by (a) diversifying the training data and (b) broadening the choices of nodes that can be queried next. This highlights an exploration, exploitation, and diversification trade-off in our problem that goes beyond the exploration and exploitation duality found in classic sequential decision problems. From these observations we propose D3TS, a method based on multi-armed bandits for non-stationary stochastic processes that enforces classifier diversity, matching or exceeding the performance of competing methods on seven real network datasets in our evaluation.


Tuning Free Orthogonal Matching Pursuit

arXiv.org Machine Learning

Orthogonal matching pursuit (OMP) is a widely used compressive sensing (CS) algorithm for recovering sparse signals in noisy linear regression models. The performance of OMP depends on its stopping criteria (SC). SC for OMP discussed in literature typically assumes knowledge of either the sparsity of the signal to be estimated $k_0$ or noise variance $\sigma^2$, both of which are unavailable in many practical applications. In this article we develop a modified version of OMP called tuning free OMP or TF-OMP which does not require a SC. TF-OMP is proved to accomplish successful sparse recovery under the usual assumptions on restricted isometry constants (RIC) and mutual coherence of design matrix. TF-OMP is numerically shown to deliver a highly competitive performance in comparison with OMP having \textit{a priori} knowledge of $k_0$ or $\sigma^2$. Greedy algorithm for robust de-noising (GARD) is an OMP like algorithm proposed for efficient estimation in classical overdetermined linear regression models corrupted by sparse outliers. However, GARD requires the knowledge of inlier noise variance which is difficult to estimate. We also produce a tuning free algorithm (TF-GARD) for efficient estimation in the presence of sparse outliers by extending the operating principle of TF-OMP to GARD. TF-GARD is numerically shown to achieve a performance comparable to that of the existing implementation of GARD.


Online Learning for Distribution-Free Prediction

arXiv.org Machine Learning

We develop an online learning method for prediction, which is important in problems with large and/or streaming data sets. We formulate the learning approach using a covariance-fitting methodology, and show that the resulting predictor has desirable computational and distribution-free properties: It is implemented online with a runtime that scales linearly in the number of samples; has a constant memory requirement; avoids local minima problems; and prunes away redundant feature dimensions without relying on restrictive assumptions on the data distribution. In conjunction with the split conformal approach, it also produces distribution-free prediction confidence intervals in a computationally efficient manner. The method is demonstrated on both real and synthetic datasets.


mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions

arXiv.org Machine Learning

We present mlrMBO, a flexible and comprehensive R toolbox for model-based optimization (MBO), also known as Bayesian optimization, which addresses the problem of expensive black-box optimization by approximating the given objective function through a surrogate regression model. It is designed for both single- and multi-objective optimization with mixed continuous, categorical and conditional parameters. Additional features include multi-point batch proposal, parallelization, visualization, logging and error-handling. mlrMBO is implemented in a modular fashion, such that single components can be easily replaced or adapted by the user for specific use cases, e.g., any regression learner from the mlr toolbox for machine learning can be used, and infill criteria and infill optimizers are easily exchangeable. We empirically demonstrate that mlrMBO provides state-of-the-art performance by comparing it on different benchmark scenarios against a wide range of other optimizers, including DiceOptim, rBayesianOptimization, SPOT, SMAC, Spearmint, and Hyperopt.


"Multicollinearity" a Problem or an Opportunity?

@machinelearnbot

Multicollinearity (Collinearity) is not a new term especially when dealing with multiple regression models. This phenomenon of relationship in between one response variable with the set of predictor variables also include models like classification and regression trees as well as neural networks. Collinearity is infamously famous for inflating the variance of at least one estimated regression coefficient, which can cause the model to predict erroneously and in a business setup it can have an unrepairable consequence. So, the next logical question is how to identify collinearity? In this article we will only talk about the Variance Inflation Factor(VIF) identification technique which is very useful for identify high multicollinearity among the predictor variables when working with MLR (Multiple Linear Regression Models).