Genre
Optimal classification in sparse Gaussian graphic model
Fan, Yingying, Jin, Jiashun, Yao, Zhigang
Consider a two-class classification problem where the number of features is much larger than the sample size. The features are masked by Gaussian noise with mean zero and covariance matrix $\Sigma$, where the precision matrix $\Omega=\Sigma^{-1}$ is unknown but is presumably sparse. The useful features, also unknown, are sparse and each contributes weakly (i.e., rare and weak) to the classification decision. By obtaining a reasonably good estimate of $\Omega$, we formulate the setting as a linear regression model. We propose a two-stage classification method where we first select features by the method of Innovated Thresholding (IT), and then use the retained features and Fisher's LDA for classification. In this approach, a crucial problem is how to set the threshold of IT. We approach this problem by adapting the recent innovation of Higher Criticism Thresholding (HCT). We find that when useful features are rare and weak, the limiting behavior of HCT is essentially just as good as the limiting behavior of ideal threshold, the threshold one would choose if the underlying distribution of the signals is known (if only). Somewhat surprisingly, when $\Omega$ is sufficiently sparse, its off-diagonal coordinates usually do not have a major influence over the classification decision. Compared to recent work in the case where $\Omega$ is the identity matrix [Proc. Natl. Acad. Sci. USA 105 (2008) 14790-14795; Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 (2009) 4449-4470], the current setting is much more general, which needs a new approach and much more sophisticated analysis. One key component of the analysis is the intimate relationship between HCT and Fisher's separation. Another key component is the tight large-deviation bounds for empirical processes for data with unconventional correlation structures, where graph theory on vertex coloring plays an important role.
Hypothesis Testing for Automated Community Detection in Networks
Bickel, Peter J., Sarkar, Purnamrita
Community detection in networks is a key exploratory tool with applications in a diverse set of areas, ranging from finding communities in social and biological networks to identifying link farms in the World Wide Web. The problem of finding communities or clusters in a network has received much attention from statistics, physics and computer science. However, most clustering algorithms assume knowledge of the number of clusters k. In this paper we propose to automatically determine k in a graph generated from a Stochastic Blockmodel. Our main contribution is twofold; first, we theoretically establish the limiting distribution of the principal eigenvalue of the suitably centered and scaled adjacency matrix, and use that distribution for our hypothesis test. Secondly, we use this test to design a recursive bipartitioning algorithm. Using quantifiable classification tasks on real world networks with ground truth, we show that our algorithm outperforms existing probabilistic models for learning overlapping clusters, and on unlabeled networks, we show that we uncover nested community structure.
Analyzing Evolutionary Optimization in Noisy Environments
Qian, Chao, Yu, Yang, Zhou, Zhi-Hua
Many optimization tasks have to be handled in noisy environments, where we cannot obtain the exact evaluation of a solution but only a noisy one. For noisy optimization tasks, evolutionary algorithms (EAs), a kind of stochastic metaheuristic search algorithm, have been widely and successfully applied. Previous work mainly focuses on empirical studying and designing EAs for noisy optimization, while, the theoretical counterpart has been little investigated. In this paper, we investigate a largely ignored question, i.e., whether an optimization problem will always become harder for EAs in a noisy environment. We prove that the answer is negative, with respect to the measurement of the expected running time. The result implies that, for optimization tasks that have already been quite hard to solve, the noise may not have a negative effect, and the easier a task the more negatively affected by the noise. On a representative problem where the noise has a strong negative effect, we examine two commonly employed mechanisms in EAs dealing with noise, the re-evaluation and the threshold selection strategies. The analysis discloses that the two strategies, however, both are not effective, i.e., they do not make the EA more noise tolerant. We then find that a small modification of the threshold selection allows it to be proven as an effective strategy for dealing with the noise in the problem.
A survey on independence-based Markov networks learning
Name Reference Comments KS Koller and Sahami (1996) - Not Sound - The first one of this type - Requires specifying MB size in advance GS Margaritis and Thrun (2000) - Sound in theory - Proposed to learn Bayesian network via the induction of neighbors of each variable - First proved such kind of algorithm - Works in two phases: grow and shrink IAMB and its variants Tsamardinos et al (2003) - Sound in theory - Actually variant of GS - Simple to implement - Time efficient - Very poor on data efficiency - IAMB's variants achieve better performance on data efficiency than IAMB HITON-PC/MB Aliferis et al (2003) - Not sound - Another trial to make use of the topology information to enhance data efficiency - Data efficiency comparable to IAMB - Much slower compared to IAMB Fast-IAMB Yaramakala and Margaritis (2005) - Sound in theory - No fundamental difference as compared to IAMB - Adds candidates more greedily to speed up the learning - Still poor on data efficiency performance MMPC/MB Tsamardinos et al (2006) - Not sound - The first to make use of the underling topology information - Much more data efficient compared to IAMB - Much slower compared to IAMB PCMB Peรฑa et al (2007) - Sound in theory - Data efficient by making use of topology information - Poor on time efficiency - Distinguish spouses from parents/children - Distinguish some children from parents/children IPC-MB Fu and Desmarais (2008) - Sound in theory - Most data efficient compared with previous algorithms - Much faster than PCMB on computing - Distinguish spouses from parents/children - Distinguish some children from parents/children - Best tradeoff among this family of algorithms
Stochastic gradient descent on Riemannian manifolds
Stochastic gradient descent is a simple approach to find the local minima of a cost function whose evaluations are corrupted by noise. In this paper, we develop a procedure extending stochastic gradient descent algorithms to the case where the function is defined on a Riemannian manifold. We prove that, as in the Euclidian case, the gradient descent algorithm converges to a critical point of the cost function. The algorithm has numerous potential applications, and is illustrated here by four examples. In particular a novel gossip algorithm on the set of covariance matrices is derived and tested numerically.
Nonparametric Bayes dynamic modeling of relational data
Durante, Daniele, Dunson, David B.
Symmetric binary matrices representing relations among entities are commonly collected in many areas. Our focus is on dynamically evolving binary relational matrices, with interest being in inference on the relationship structure and prediction. We propose a nonparametric Bayesian dynamic model, which reduces dimensionality in characterizing the binary matrix through a lower-dimensional latent space representation, with the latent coordinates evolving in continuous time via Gaussian processes. By using a logistic mapping function from the probability matrix space to the latent relational space, we obtain a flexible and computational tractable formulation. Employing P\`olya-Gamma data augmentation, an efficient Gibbs sampler is developed for posterior computation, with the dimension of the latent space automatically inferred. We provide some theoretical results on flexibility of the model, and illustrate performance via simulation experiments. We also consider an application to co-movements in world financial markets.
Domain Adaptation of Majority Votes via Perturbed Variation-based Label Transfer
We tackle the PAC-Bayesian Domain Adaptation (DA) problem. This arrives when one desires to learn, from a source distribution, a good weighted majority vote (over a set of classifiers) on a different target distribution. In this context, the disagreement between classifiers is known crucial to control. In non-DA supervised setting, a theoretical bound - the C-bound - involves this disagreement and leads to a majority vote learning algorithm: MinCq. In this work, we extend MinCq to DA by taking advantage of an elegant divergence between distribution called the Perturbed Varation (PV). Firstly, justified by a new formulation of the C-bound, we provide to MinCq a target sample labeled thanks to a PV-based self-labeling focused on regions where the source and target marginal distributions are closer. Secondly, we propose an original process for tuning the hyperparameters. Our framework shows very promising results on a toy problem.
Near-Optimal Entrywise Sampling for Data Matrices
Achlioptas, Dimitris, Karnin, Zohar, Liberty, Edo
We consider the problem of selecting non-zero entries of a matrix $A$ in order to produce a sparse sketch of it, $B$, that minimizes $\|A-B\|_2$. For large $m \times n$ matrices, such that $n \gg m$ (for example, representing $n$ observations over $m$ attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding $A$. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with $O(1)$ computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model.
From Maxout to Channel-Out: Encoding Information on Sparse Pathways
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
Sigma Point Belief Propagation
Meyer, Florian, Hlinka, Ondrej, Hlawatsch, Franz
The sigma point (SP) filter, also known as unscented Kalman filter, is an attractive alternative to the extended Kalman filter and the particle filter. Here, we extend the SP filter to nonsequential Bayesian inference corresponding to loopy factor graphs. We propose sigma point belief propagation (SPBP) as a low-complexity approximation of the belief propagation (BP) message passing scheme. SPBP achieves approximate marginalizations of posterior distributions corresponding to (generally) loopy factor graphs. It is well suited for decentralized inference because of its low communication requirements. For a decentralized, dynamic sensor localization problem, we demonstrate that SPBP can outperform nonparametric (particle-based) BP while requiring significantly less computations and communications.