Not enough data to create a plot.
Try a different view from the menu above.
arXiv.org Machine Learning
Kernel Regression by Mode Calculation of the Conditional Probability Distribution
The most direct way to express arbitrary dependencies in datasets is to estimate the joint distribution and to apply afterwards the argmax-function to obtain the mode of the corresponding conditional distribution. This method is in practice difficult, because it requires a global optimization of a complicated function, the joint distribution by fixed input variables. This article proposes a method for finding global maxima if the joint distribution is modeled by a kernel density estimation. Some experiments show advantages and shortcomings of the resulting regression method in comparison to the standard Nadaraya-Watson regression technique, which approximates the optimum by the expectation value.
Random Forests: some methodological insights
Genuer, Robin, Poggi, Jean-Michel, Tuleau, Christine
This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size. But the main contribution of this paper is twofold: to provide some insights about the behavior of the variable importance index based on random forests and in addition, to propose to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The strategy involves a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.
High-dimensional covariance estimation by minimizing $\ell_1$-penalized log-determinant divergence
Ravikumar, Pradeep, Wainwright, Martin J., Raskutti, Garvesh, Yu, Bin
Given i.i.d. observations of a random vector $X \in \mathbb{R}^p$, we study the problem of estimating both its covariance matrix $\Sigma^*$, and its inverse covariance or concentration matrix {$\Theta^* = (\Sigma^*)^{-1}$.} We estimate $\Theta^*$ by minimizing an $\ell_1$-penalized log-determinant Bregman divergence; in the multivariate Gaussian case, this approach corresponds to $\ell_1$-penalized maximum likelihood, and the structure of $\Theta^*$ is specified by the graph of an associated Gaussian Markov random field. We analyze the performance of this estimator under high-dimensional scaling, in which the number of nodes in the graph $p$, the number of edges $s$ and the maximum node degree $d$, are allowed to grow as a function of the sample size $n$. In addition to the parameters $(p,s,d)$, our analysis identifies other key quantities covariance matrix $\Sigma^*$; and (b) the $\ell_\infty$ operator norm of the sub-matrix $\Gamma^*_{S S}$, where $S$ indexes the graph edges, and $\Gamma^* = (\Theta^*)^{-1} \otimes (\Theta^*)^{-1}$; and (c) a mutual incoherence or irrepresentability measure on the matrix $\Gamma^*$ and (d) the rate of decay $1/f(n,\delta)$ on the probabilities $ \{|\hat{\Sigma}^n_{ij}- \Sigma^*_{ij}| > \delta \}$, where $\hat{\Sigma}^n$ is the sample covariance based on $n$ samples. Our first result establishes consistency of our estimate $\hat{\Theta}$ in the elementwise maximum-norm. This in turn allows us to derive convergence rates in Frobenius and spectral norms, with improvements upon existing results for graphs with maximum node degrees $d = o(\sqrt{s})$. In our second result, we show that with probability converging to one, the estimate $\hat{\Theta}$ correctly specifies the zero pattern of the concentration matrix $\Theta^*$.
Inference with Discriminative Posterior
Salojärvi, Jarkko, Puolamäki, Kai, Savia, Eerika, Kaski, Samuel
We study Bayesian discriminative inference given a model family $p(c,\x, \theta)$ that is assumed to contain all our prior information but still known to be incorrect. This falls in between "standard" Bayesian generative modeling and Bayesian regression, where the margin $p(\x,\theta)$ is known to be uninformative about $p(c|\x,\theta)$. We give an axiomatic proof that discriminative posterior is consistent for conditional inference; using the discriminative posterior is standard practice in classical Bayesian regression, but we show that it is theoretically justified for model families of joint densities as well. A practical benefit compared to Bayesian regression is that the standard methods of handling missing values in generative modeling can be extended into discriminative inference, which is useful if the amount of data is small. Compared to standard generative modeling, discriminative posterior results in better conditional inference if the model family is incorrect. If the model family contains also the true model, the discriminative posterior gives the same result as standard Bayesian generative modeling. Practical computation is done with Markov chain Monte Carlo.
A Multivariate Regression Approach to Association Analysis of Quantitative Trait Network
Kim, Seyoung, Sohn, Kyung-Ah, Xing, Eric P.
Many complex disease syndromes such as asthma consist of a large number of highly related, rather than independent, clinical phenotypes, raising a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. In this study, we propose a new statistical framework called graph-guided fused lasso (GFlasso) to address this issue in a principled way. Our approach explicitly represents the dependency structure among the quantitative traits as a network, and leverages this trait network to encode structured regularizations in a multivariate regression model over the genotypes and traits, so that the genetic markers that jointly influence subgroups of highly correlated traits can be detected with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently and combined the results afterwards, our approach analyzes all of the traits jointly in a single statistical method, and borrow information across correlated phenotypes to discover the genetic markers that perturbe a subset of correlated triats jointly rather than a single trait. Using simulated datasets based on the HapMap consortium data and an asthma dataset, we compare the performance of our method with the single-marker analysis, and other sparse regression methods such as the ridge regression and the lasso that do not use any structural information in the traits. Our results show that there is a significant advantage in detecting the true causal SNPs when we incorporate the correlation pattern in traits using our proposed methods.
Improved Estimation of High-dimensional Ising Models
We consider the problem of jointly estimating the parameters as well as the structure of binary valued Markov Random Fields, in contrast to earlier work that focus on one of the two problems. We formulate the problem as a maximization of $\ell_1$-regularized surrogate likelihood that allows us to find a sparse solution. Our optimization technique efficiently incorporates the cutting-plane algorithm in order to obtain a tighter outer bound on the marginal polytope, which results in improvement of both parameter estimates and approximation to marginals. On synthetic data, we compare our algorithm on the two estimation tasks to the other existing methods. We analyze the method in the high-dimensional setting, where the number of dimensions $p$ is allowed to grow with the number of observations $n$. The rate of convergence of the estimate is demonstrated to depend explicitly on the sparsity of the underlying graph.
Hierarchical structure and the prediction of missing links in networks
Clauset, Aaron, Moore, Cristopher, Newman, M. E. J.
Networks have in recent years emerged as an invaluable tool for describing and quantifying complex systems in many branches of science. Recent studies suggest that networks often exhibit hierarchical organization, where vertices divide into groups that further subdivide into groups of groups, and so forth over multiple scales. In many cases these groups are found to correspond to known functional units, such as ecological niches in food webs, modules in biochemical networks (protein interaction networks, metabolic networks, or genetic regulatory networks), or communities in social networks. Here we present a general technique for inferring hierarchical structure from network data and demonstrate that the existence of hierarchy can simultaneously explain and quantitatively reproduce many commonly observed topological properties of networks, such as right-skewed degree distributions, high clustering coefficients, and short path lengths. We further show that knowledge of hierarchical structure can be used to predict missing connections in partially known networks with high accuracy, and for more general network structures than competing techniques. Taken together, our results suggest that hierarchy is a central organizing principle of complex networks, capable of offering insight into many network phenomena.
Gibbs posterior for variable selection in high-dimensional classification and data mining
Jiang, Wenxin, Tanner, Martin A.
In the popular approach of "Bayesian variable selection" (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classification error) and aims at minimizing a risk function without modeling the data probabilistically. This can improve the performance over the usual Bayesian approach, which depends on a probability model which may be misspecified. Conditions will be provided to achieve good risk performance, even in the presence of high dimensionality, when the number of candidate variables "$K$" can be much larger than the sample size "$n$." In addition, we develop a convenient Markov chain Monte Carlo algorithm to implement BVS with the Gibbs posterior.
Choice of neighbor order in nearest-neighbor classification
Hall, Peter, Park, Byeong U., Samworth, Richard J.
The $k$th-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manner in which it is influenced by the value of $k$; and by the absence of techniques for empirical choice of $k$. In the present paper we detail the way in which the value of $k$ determines the misclassification error. We consider two models, Poisson and Binomial, for the training samples. Under the first model, data are recorded in a Poisson stream and are "assigned" to one or other of the two populations in accordance with the prior probabilities. In particular, the total number of data in both training samples is a Poisson-distributed random variable. Under the Binomial model, however, the total number of data in the training samples is fixed, although again each data value is assigned in a random way. Although the values of risk and regret associated with the Poisson and Binomial models are different, they are asymptotically equivalent to first order, and also to the risks associated with kernel-based classifiers that are tailored to the case of two derivatives. These properties motivate new methods for choosing the value of $k$.
Statistical Learning Theory: Models, Concepts, and Results
von Luxburg, Ulrike, Schoelkopf, Bernhard
Statistical learning theory provides the theoretical basis for many of today's machine learning algorithms. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We target at a broad audience, not necessarily machine learning researchers. This paper can serve as a starting point for people who want to get an overview on the field before diving into technical details.