Statistical Learning
Hybrid Deterministic-Stochastic Methods for Data Fitting
Friedlander, Michael P., Schmidt, Mark
Many structured data-fitting applications require the solution of an optimization problem involving a sum over a potentially large number of measurements. Incremental gradient algorithms offer inexpensive iterations by sampling a subset of the terms in the sum. These methods can make great progress initially, but often slow as they approach a solution. In contrast, full-gradient methods achieve steady convergence at the expense of evaluating the full objective and gradient on each iteration. We explore hybrid methods that exhibit the benefits of both approaches. Rate-of-convergence analysis shows that by controlling the sample size in an incremental gradient algorithm, it is possible to maintain the steady convergence rates of full-gradient methods. We detail a practical quasi-Newton implementation based on this approach. Numerical experiments illustrate its potential benefits.
Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models
Flynn, Cheryl J., Hurvich, Clifford M., Simonoff, Jeffrey S.
It has been shown that AIC-type criteria are asymptotically efficient selectors of the tuning parameter in non-concave penalized regression methods under the assumption that the population variance is known or that a consistent estimator is available. We relax this assumption to prove that AIC itself is asymptotically efficient and we study its performance in finite samples. In classical regression, it is known that AIC tends to select overly complex models when the dimension of the maximum candidate model is large relative to the sample size. Simulation studies suggest that AIC suffers from the same shortcomings when used in penalized regression. We therefore propose the use of the classical corrected AIC (AICc) as an alternative and prove that it maintains the desired asymptotic properties. To broaden our results, we further prove the efficiency of AIC for penalized likelihood methods in the context of generalized linear models with no dispersion parameter. Similar results exist in the literature but only for a restricted set of candidate models. By employing results from the classical literature on maximum-likelihood estimation in misspecified models, we are able to establish this result for a general set of candidate models. We use simulations to assess the performance of AIC and AICc, as well as that of other selectors, in finite samples for both SCAD-penalized and Lasso regressions and a real data example is considered.
Feature Selection for Microarray Gene Expression Data using Simulated Annealing guided by the Multivariate Joint Entropy
González, Fernando, Belanche, Lluís A.
In cancer diagnosis, classification of the different tumor types is of great importance. An accurate prediction of different tumor types provides better treatment and toxicity minimization on patients. Traditional methods of tackling this situation are primarily based on morphological characteristics of tumorous tissue [1]. These conventional methods are reported to have several diagnosis limitations. In order to analyze the problem of cancer classification using gene expression data, more systematic approaches have been developed [2]. Pioneering work in cancer classification by gene expression using DNA microarray showed the possibility to help the diagnosis by means of Machine Learning or more generally Data Mining methods [3], which are now extensively used for this task [4]. However, in this setting gene expression data analysis entails a heavy computational consumption of resources, due to the extreme sparseness compared to standard data sets in classification tasks [5]. Typically, a gene expression data set may consist of dozens of observations but with thousands or even tens of thousands of genes.
An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering
Kearns, Michael, Mansour, Yishay, Ng, Andrew Y.
Assignment methods are at the heart of many algorithms for unsupervised learning and clustering - in particular, the well-known K-means and Expectation-Maximization (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the ?soft' assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differences of behavior between the two algorithms. Here we shed light on these differences via an information-theoretic analysis. The cornerstone of our results is a simple decomposition of the expected distortion, showing that K-means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters. How well the data are balanced is measured by the entropy of the partition defined by the hard assignments. In addition to letting us predict and verify systematic differences between K-means and EM on specific examples, the decomposition allows us to give a rather general argument showing that K ?means will consistently find densities with less "overlap" than EM. We also study a third natural assignment method that we call posterior assignment, that is close in spirit to the soft assignments of EM, but leads to a surprisingly different algorithm.
Update Rules for Parameter Estimation in Bayesian Networks
Bauer, Eric, Koller, Daphne, Singer, Yoram
This paper re-examines the problem of parameter estimation in Bayesian networks with missing values and hidden variables from the perspective of recent work in on-line learning [Kivinen & Warmuth, 1994]. We provide a unified framework for parameter estimation that encompasses both on-line learning, where the model is continuously adapted to new data cases as they arrive, and the more traditional batch learning, where a pre-accumulated set of samples is used in a one-time model selection process. In the batch case, our framework encompasses both the gradient projection algorithm and the EM algorithm for Bayesian networks. The framework also leads to new on-line and batch parameter update schemes, including a parameterized version of EM. We provide both empirical and theoretical results indicating that parameterized EM allows faster convergence to the maximum likelihood parameters than does standard EM.
Centrality-constrained graph embedding
Baingana, Brian, Giannakis, Georgios B.
Visual rendering of graphs is a key task in the mapping of complex network data. Although most graph drawing algorithms emphasize aesthetic appeal, certain applications such as travel-time maps place more importance on visualization of structural network properties. The present paper advocates a graph embedding approach with centrality considerations to comply with node hierarchy. The problem is formulated as one of constrained multi-dimensional scaling (MDS), and it is solved via block coordinate descent iterations with successive approximations and guaranteed convergence to a KKT point. In addition, a regularization term enforcing graph smoothness is incorporated with the goal of reducing edge crossings. Experimental results demonstrate that the algorithm converges, and can be used to efficiently embed large graphs on the order of thousands of nodes.
Factoring nonnegative matrices with linear programs
Bittorf, Victor, Recht, Benjamin, Re, Christopher, Tropp, Joel A.
This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identifies a matrix C such that X approximately equals CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to find a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efficient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes.
Sparse Multiple Kernel Learning with Geometric Convergence Rate
Jin, Rong, Yang, Tianbao, Mahdavi, Mehrdad
In this paper, we study the problem of sparse multiple kernel learning (MKL), where the goal is to efficiently learn a combination of a fixed small number of kernels from a large pool that could lead to a kernel classifier with a small prediction error. We develop an efficient algorithm based on the greedy coordinate descent algorithm, that is able to achieve a geometric convergence rate under appropriate conditions. The convergence rate is achieved by measuring the size of functional gradients by an empirical $\ell_2$ norm that depends on the empirical data distribution. This is in contrast to previous algorithms that use a functional norm to measure the size of gradients, which is independent from the data samples. We also establish a generalization error bound of the learned sparse kernel classifier using the technique of local Rademacher complexity.
Regression shrinkage and grouping of highly correlated predictors with HORSES
Jang, Woncheol, Lim, Johan, Lazar, Nicole A., Loh, Ji Meng, Yu, Donghyeon
Identifying homogeneous subgroups of variables can be challenging in high dimensional data analysis with highly correlated predictors. We propose a new method called Hexagonal Operator for Regression with Shrinkage and Equality Selection, HORSES for short, that simultaneously selects positively correlated variables and identifies them as predictive clusters. This is achieved via a constrained least-squares problem with regularization that consists of a linear combination of an L_1 penalty for the coefficients and another L_1 penalty for pairwise differences of the coefficients. This specification of the penalty function encourages grouping of positively correlated predictors combined with a sparsity solution. We construct an efficient algorithm to implement the HORSES procedure. We show via simulation that the proposed method outperforms other variable selection methods in terms of prediction error and parsimony. The technique is demonstrated on two data sets, a small data set from analysis of soil in Appalachia, and a high dimensional data set from a near infrared (NIR) spectroscopy study, showing the flexibility of the methodology.
Distribution-Free Distribution Regression
Poczos, Barnabas, Rinaldo, Alessandro, Singh, Aarti, Wasserman, Larry
'Distribution regression' refers to the situation where a response Y depends on a covariate P where P is a probability distribution. The model is Y f(P) µ where f is an unknown regression function and µ is a random error. Typically, we do not observe P directly, but rather, we observe a sample from P. In this paper we develop theory and methods for distribution-free versions of distribution regression. This means that we do not make distributional assumptions about the error term µ and covariate P. We prove that when the effective dimension is small enough (as measured by the doubling dimension), then the excess prediction risk converges to zero with a polynomial rate.