Goto

Collaborating Authors

 Statistical Learning


Inverse Density as an Inverse Problem: The Fredholm Equation Approach

arXiv.org Machine Learning

In this paper we address the problem of estimating the ratio $\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration, in particular, when one needs to average a function with respect to one probability distribution, given a sample from another. It is often referred as {\it importance sampling} in statistical inference and is also closely related to the problem of {\it covariate shift} in transfer learning as well as to various MCMC methods. It may also be useful for separating the underlying geometry of a space, say a manifold, from the density function defined on it. Our approach is based on reformulating the problem of estimating $\frac{q}{p}$ as an inverse problem in terms of an integral operator corresponding to a kernel, and thus reducing it to an integral equation, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization and kernel methods, leads to a principled kernel-based framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel in the case of densities defined on $\R^d$, compact domains in $\R^d$ and smooth $d$-dimensional sub-manifolds of the Euclidean space. We also show experimental results including applications to classification and semi-supervised learning within the covariate shift framework and demonstrate some encouraging experimental comparisons. We also show how the parameters of our algorithms can be chosen in a completely unsupervised manner.


On latent position inference from doubly stochastic messaging activities

arXiv.org Machine Learning

We model messaging activities as a hierarchical doubly stochastic point process with three main levels, and develop an iterative algorithm for inferring actors' relative latent positions from a stream of messaging activity data. Each of the message-exchanging actors is modeled as a process in a latent space. The actors' latent positions are assumed to be influenced by the distribution of a much larger population over the latent space. Each actor's movement in the latent space is modeled as being governed by two parameters that we call confidence and visibility, in addition to dependence on the population distribution. The messaging frequency between a pair of actors is assumed to be inversely proportional to the distance between their latent positions. Our inference algorithm is based on a projection approach to an online filtering problem. The algorithm associates each actor with a probability density-valued process, and each probability density is assumed to be a mixture of basis functions. For efficient numerical experiments, we further develop our algorithm for the case where the basis functions are obtained by translating and scaling a standard Gaussian density.


The K-modes algorithm for clustering

arXiv.org Machine Learning

Many clustering algorithms exist that estimate a cluster centroid, such as K-means, K-medoids or mean-shift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a K-modes objective function by combining the notions of density and cluster assignment. The algorithm becomes K-means and K-medoids in the limit of very large and very small scales. Computationally, it is slightly slower than K-means but much faster than mean-shift or K-medoids. Unlike K-means, it is able to find centroids that are valid patterns, truly representative of a cluster, even with nonconvex clusters, and appears robust to outliers and misspecification of the scale and number of clusters.


Solving Support Vector Machines in Reproducing Kernel Banach Spaces with Positive Definite Functions

arXiv.org Machine Learning

In this paper we solve support vector machines in reproducing kernel Banach spaces with reproducing kernels defined on nonsymmetric domains instead of the traditional methods in reproducing kernel Hilbert spaces. Using the orthogonality of semi-inner-products, we can obtain the explicit representations of the dual (normalized-duality-mapping) elements of support vector machine solutions. In addition, we can introduce the reproduction property in a generalized native space by Fourier transform techniques such that it becomes a reproducing kernel Banach space, which can be even embedded into Sobolev spaces, and its reproducing kernel is set up by the related positive definite function. The representations of the optimal solutions of support vector machines (regularized empirical risks) in these reproducing kernel Banach spaces are formulated explicitly in terms of positive definite functions, and their finite numbers of coefficients can be computed by fixed point iteration. We also give some typical examples of reproducing kernel Banach spaces induced by Mat\'ern functions (Sobolev splines) so that their support vector machine solutions are well computable as the classical algorithms. Moreover, each of their reproducing bases includes information from multiple training data points. The concept of reproducing kernel Banach spaces offers us a new numerical tool for solving support vector machines.


Cost-Sensitive Tree of Classifiers

arXiv.org Machine Learning

Recently, machine learning algorithms have successfully entered large-scale real-world industrial applications (e.g. search engines and email spam filters). Here, the CPU cost during test time must be budgeted and accounted for. In this paper, we address the challenge of balancing the test-time cost and the classifier accuracy in a principled fashion. The test-time cost of a classifier is often dominated by the computation required for feature extraction-which can vary drastically across eatures. We decrease this extraction time by constructing a tree of classifiers, through which test inputs traverse along individual paths. Each path extracts different features and is optimized for a specific sub-partition of the input space. By only computing features for inputs that benefit from them the most, our cost sensitive tree of classifiers can match the high accuracies of the current state-of-the-art at a small fraction of the computational cost.


Stochastic Variational Inference

arXiv.org Machine Learning

We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.


Learning loopy graphical models with latent variables: Efficient methods and guarantees

arXiv.org Artificial Intelligence

The problem of structure estimation in graphical models with latent variables is considered. We characterize conditions for tractable graph estimation and develop efficient methods with provable guarantees. We consider models where the underlying Markov graph is locally tree-like, and the model is in the regime of correlation decay. For the special case of the Ising model, the number of samples $n$ required for structural consistency of our method scales as $n=\Omega(\theta_{\min}^{-\delta\eta(\eta+1)-2}\log p)$, where p is the number of variables, $\theta_{\min}$ is the minimum edge potential, $\delta$ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and $\eta$ is a parameter which depends on the bounds on node and edge potentials in the Ising model. Necessary conditions for structural consistency under any algorithm are derived and our method nearly matches the lower bound on sample requirements. Further, the proposed method is practical to implement and provides flexibility to control the number of latent variables and the cycle lengths in the output graph.


Analytic Feature Selection for Support Vector Machines

arXiv.org Machine Learning

Support vector machines (SVMs) rely on the inherent geometry of a data set to classify training data. Because of this, we believe SVMs are an excellent candidate to guide the development of an analytic feature selection algorithm, as opposed to the more commonly used heuristic methods. We propose a filter-based feature selection algorithm based on the inherent geometry of a feature set. Through observation, we identified six geometric properties that differ between optimal and suboptimal feature sets, and have statistically significant correlations to classifier performance. Our algorithm is based on logistic and linear regression models using these six geometric properties as predictor variables. The proposed algorithm achieves excellent results on high dimensional text data sets, with features that can be organized into a handful of feature types; for example, unigrams, bigrams or semantic structural features. We believe this algorithm is a novel and effective approach to solving the feature selection problem for linear SVMs.


Graph Estimation From Multi-attribute Data

arXiv.org Machine Learning

Many real world network problems often concern multivariate nodal attributes such as image, textual, and multi-view feature vectors on nodes, rather than simple univariate nodal attributes. The existing graph estimation methods built on Gaussian graphical models and covariance selection algorithms can not handle such data, neither can the theories developed around such methods be directly applied. In this paper, we propose a new principled framework for estimating graphs from multi-attribute data. Instead of estimating the partial correlation as in current literature, our method estimates the partial canonical correlations that naturally accommodate complex nodal features. Computationally, we provide an efficient algorithm which utilizes the multi-attribute structure. Theoretically, we provide sufficient conditions which guarantee consistent graph recovery. Extensive simulation studies demonstrate performance of our method under various conditions. Furthermore, we provide illustrative applications to uncovering gene regulatory networks from gene and protein profiles, and uncovering brain connectivity graph from functional magnetic resonance imaging data.


Analytic Expressions for Stochastic Distances Between Relaxed Complex Wishart Distributions

arXiv.org Machine Learning

The scaled complex Wishart distribution is a widely used model for multilook full polarimetric SAR data whose adequacy has been attested in the literature. Classification, segmentation, and image analysis techniques which depend on this model have been devised, and many of them employ some type of dissimilarity measure. In this paper we derive analytic expressions for four stochastic distances between relaxed scaled complex Wishart distributions in their most general form and in important particular cases. Using these distances, inequalities are obtained which lead to new ways of deriving the Bartlett and revised Wishart distances. The expressiveness of the four analytic distances is assessed with respect to the variation of parameters. Such distances are then used for deriving new tests statistics, which are proved to have asymptotic chi-square distribution. Adopting the test size as a comparison criterion, a sensitivity study is performed by means of Monte Carlo experiments suggesting that the Bhattacharyya statistic outperforms all the others. The power of the tests is also assessed. Applications to actual data illustrate the discrimination and homogeneity identification capabilities of these distances.