Statistical Learning
Optimal Discriminant Functions Based On Sampled Distribution Distance for Modulation Classification
Urriza, Paulo, Rebeiz, Eric, Cabric, Danijela
In this letter, we derive the optimal discriminant functions for modulation classification based on the sampled distribution distance. The proposed method classifies various candidate constellations using a low complexity approach based on the distribution distance at specific testpoints along the cumulative distribution function. This method, based on the Bayesian decision criteria, asymptotically provides the minimum classification error possible given a set of testpoints. Testpoint locations are also optimized to improve classification performance. The method provides significant gains over existing approaches that also use the distribution of the signal features.
No More Pesky Learning Rates
Schaul, Tom, Zhang, Sixin, LeCun, Yann
The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.
Clustering validity based on the most similarity
Namayandeh, Raheleh, Didehvar, Farzad, Shojaei, Zahra
One basic requirement of many studies is the necessity of classifying data. Clustering is a proposed method for summarizing networks. Clustering methods can be divided into two categories named model-based approaches and algorithmic approaches. Since the most of clustering methods depend on their input parameters, it is important to evaluate the result of a clustering algorithm with its different input parameters, to choose the most appropriate one. There are several clustering validity techniques based on inner density and outer density of clusters that represent different metrics to choose the most appropriate clustering independent of the input parameters. According to dependency of previous methods on the input parameters, one challenge in facing with large systems, is to complete data incrementally that effects on the final choice of the most appropriate clustering. Those methods define the existence of high intensity in a cluster, and low intensity among different clusters as the measure of choosing the optimal clustering. This measure has a tremendous problem, not availing all data at the first stage. In this paper, we introduce an efficient measure in which maximum number of repetitions for various initial values occurs.
Layer-wise learning of deep generative models
Arnold, Ludovic, Ollivier, Yann
When using deep, multi-layered architectures to build generative models of data, it is difficult to train all layers at once. We propose a layer-wise training procedure admitting a performance guarantee compared to the global optimum. It is based on an optimistic proxy of future performance, the best latent marginal. We interpret auto-encoders in this setting as generative models, by showing that they train a lower bound of this criterion. We test the new learning procedure against a state of the art method (stacked RBMs), and find it to improve performance. Both theory and experiments highlight the importance, when training deep architectures, of using an inference model (from data to hidden variables) richer than the generative model (from hidden variables to data).
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Broderick, Tamara, Kulis, Brian, Jordan, Michael I.
The classical mixture of Gaussians model is related to K-means via small-variance asymptotics: as the covariances of the Gaussians tend to zero, the negative log-likelihood of the mixture of Gaussians model approaches the K-means objective, and the EM algorithm approaches the K-means algorithm. Kulis & Jordan (2012) used this observation to obtain a novel K-means-like algorithm from a Gibbs sampler for the Dirichlet process (DP) mixture. We instead consider applying small-variance asymptotics directly to the posterior in Bayesian nonparametric models. This framework is independent of any specific Bayesian inference algorithm, and it has the major advantage that it generalizes immediately to a range of models beyond the DP mixture. To illustrate, we apply our framework to the feature learning setting, where the beta process and Indian buffet process provide an appropriate Bayesian nonparametric prior. We obtain a novel objective function that goes beyond clustering to learn (and penalize new) groupings for which we relax the mutual exclusivity and exhaustivity assumptions of clustering. We demonstrate several other algorithms, all of which are scalable and simple to implement. Empirical results demonstrate the benefits of the new framework.
A consistent clustering-based approach to estimating the number of change-points in highly dependent time-series
Khaleghi, Azaden, Ryabko, Daniil
The problem of change-point estimation is considered under a general framework where the data are generated by unknown stationary ergodic process distributions. In this context, the consistent estimation of the number of change-points is provably impossible. However, it is shown that a consistent clustering method may be used to estimate the number of change points, under the additional constraint that the correct number of process distributions that generate the data is provided. This additional parameter has a natural interpretation in many real-world applications. An algorithm is proposed that estimates the number of change-points and locates the changes. The proposed algorithm is shown to be asymptotically consistent; its empirical evaluations are provided.
Joint variable and rank selection for parsimonious estimation of high-dimensional matrices
Bunea, Florentina, She, Yiyuan, Wegkamp, Marten H.
We propose dimension reduction methods for sparse, high-dimensional multivariate response regression models. Both the number of responses and that of the predictors may exceed the sample size. Sometimes viewed as complementary, predictor selection and rank reduction are the most popular strategies for obtaining lower-dimensional approximations of the parameter matrix in such models. We show in this article that important gains in prediction accuracy can be obtained by considering them jointly. We motivate a new class of sparse multivariate regression models, in which the coefficient matrix has low rank and zero rows or can be well approximated by such a matrix. Next, we introduce estimators that are based on penalized least squares, with novel penalties that impose simultaneous row and rank restrictions on the coefficient matrix. We prove that these estimators indeed adapt to the unknown matrix sparsity and have fast rates of convergence. We support our theoretical results with an extensive simulation study and two data analyses.
An Evaluation of Structural Parameters for Probabilistic Reasoning: Results on Benchmark Circuits
Fattah, Yousri El, Dechter, Rina
Many algorithms for processing probabilistic networks are dependent on the topological properties of the problem's structure. Such algorithms (e.g., clustering, conditioning) are effective only if the problem has a sparse graph captured by parameters such as tree width and cycle-cut set size. In this paper we initiate a study to determine the potential of structure-based algorithms in real-life applications. We analyze empirically the structural properties of problems coming from the circuit diagnosis domain. Specifically, we locate those properties that capture the effectiveness of clustering and conditioning as well as of a family of conditioning+clustering algorithms designed to gradually trade space for time. We perform our analysis on 11 benchmark circuits widely used in the testing community. We also report on the effect of ordering heuristics on tree-clustering and show that, on our benchmarks, the well-known max-cardinality ordering is substantially inferior to an ordering called min-degree.
Entailment in Probability of Thresholded Generalizations
A nonmonotonic logic of thresholded generalizations is presented. Given propositions A and B from a language L and a positive integer k, the thresholded generalization A=>B{k} means that the conditional probability P(B|A) falls short of one by no more than c*d^k. A two-level probability structure is defined. At the lower level, a model is defined to be a probability function on L. At the upper level, there is a probability distribution over models. A definition is given of what it means for a collection of thresholded generalizations to entail another thresholded generalization. This nonmonotonic entailment relation, called "entailment in probability", has the feature that its conclusions are "probabilistically trustworthy" meaning that, given true premises, it is improbable that an entailed conclusion would be false. A procedure is presented for ascertaining whether any given collection of premises entails any given conclusion. It is shown that entailment in probability is closely related to Goldszmidt and Pearl's System-Z^+, thereby demonstrating that the conclusions of System-Z^+ are probabilistically trustworthy.
The trace norm constrained matrix-variate Gaussian process for multitask bipartite ranking
Koyejo, Oluwasanmi, Lee, Cheng, Ghosh, Joydeep
We propose a novel hierarchical model for multitask bipartite ranking. The proposed approach combines a matrix-variate Gaussian process with a generative model for task-wise bipartite ranking. In addition, we employ a novel trace constrained variational inference approach to impose low rank structure on the posterior matrix-variate Gaussian process. The resulting posterior covariance function is derived in closed form, and the posterior mean function is the solution to a matrix-variate regression with a novel spectral elastic net regularizer. Further, we show that variational inference for the trace constrained matrix-variate Gaussian process combined with maximum likelihood parameter estimation for the bipartite ranking model is jointly convex. Our motivating application is the prioritization of candidate disease genes. The goal of this task is to aid the identification of unobserved associations between human genes and diseases using a small set of observed associations as well as kernels induced by gene-gene interaction networks and disease ontologies. Our experimental results illustrate the performance of the proposed model on real world datasets. Moreover, we find that the resulting low rank solution improves the computational scalability of training and testing as compared to baseline models.