Genre
Sphere Embedding: An Application to Part-of-Speech Induction
Maron, Yariv, Lamar, Michael, Bienenstock, Elie
Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a high-dimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks.
Probabilistic Deterministic Infinite Automata
Pfau, David, Bartlett, Nicholas, Wood, Frank
We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop and sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models.
Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization
Nie, Feiping, Huang, Heng, Cai, Xiao, Ding, Chris H.
Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint ℓ2,1-norm minimization on both loss function and regularization. The ℓ2,1-norm based loss function is robust to outliers in data points and the ℓ2,1-norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies were performed on six data sets to demonstrate the effectiveness of our feature selection method.
Unsupervised Kernel Dimension Reduction
Wang, Meihong, Sha, Fei, Jordan, Michael I.
We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
Copula Processes
Wilson, Andrew G., Ghahramani, Zoubin
We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures.
Spectral Regularization for Support Estimation
Vito, Ernesto D., Rosasco, Lorenzo, Toigo, Alessandro
In this paper we consider the problem of learning from data the support of a probability distribution when the distribution {\em does not} have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call {\em ``completely regular''}. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
Husmeier, Dirk, Dondelinger, Frank, Lebre, Sophie
Conventional dynamic Bayesian networks (DBNs) are based on the homogeneous Markov assumption, which is too restrictive in many practical applications. Various approaches to relax the homogeneity assumption have therefore been proposed in the last few years. The present paper aims to improve the flexibility of two recent versions of non-homogeneous DBNs, which either (i) suffer from the need for data discretization, or (ii) assume a time-invariant network structure. Allowing the network structure to be fully flexible leads to the risk of overfitting and inflated inference uncertainty though, especially in the highly topical field of systems biology, where independent measurements tend to be sparse. In the present paper we investigate three conceptually different regularization schemes based on inter-segment information sharing. We assess the performance in a comparative evaluation study based on simulated data. We compare the predicted segmentation of gene expression time series obtained during embryogenesis in Drosophila melanogaster with other state-of-the-art techniques. We conclude our evaluation with an application to synthetic biology, where the objective is to predict a known regulatory network of five genes in Saccharomyces cerevisiae.
Spatial and anatomical regularization of SVM for brain image analysis
Cuingnet, Remi, Chupin, Marie, Benali, Habib, Colliot, Olivier
Support vector machines (SVM) are increasingly used in brain image analyses since they allow capturing complex multivariate relationships in the data. Moreover, when the kernel is linear, SVMs can be used to localize spatial patterns of discrimination between two groups of subjects. However, the features' spatial distribution is not taken into account. As a consequence, the optimal margin hyperplane is often scattered and lacks spatial coherence, making its anatomical interpretation difficult. This paper introduces a framework to spatially regularize SVM for brain image analysis. We show that Laplacian regularization provides a flexible framework to integrate various types of constraints and can be applied to both cortical surfaces and 3D brain images. The proposed framework is applied to the classification of MR images based on gray matter concentration maps and cortical thickness measures from 30 patients with Alzheimer's disease and 30 elderly controls. The results demonstrate that the proposed method enables natural spatial and anatomical regularization of the classifier.
Learning Networks of Stochastic Differential Equations
Pereira, José, Ibrahimi, Morteza, Montanari, Andrea
We consider linear models for stochastic dynamics. Any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T. We analyse the l1-regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufficiently high. This result substantiates the notion of a well defined ‘time complexity’ for the network inference problem.
Variable margin losses for classifier design
Masnadi-shirazi, Hamed, Vasconcelos, Nuno
The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter.