Not enough data to create a plot.
Try a different view from the menu above.
Country
An Infinite Latent Attribute Model for Network Data
Palla, Konstantina, Knowles, David, Ghahramani, Zoubin
Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a "flat" clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.
Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring
Ahn, Sungjin, Korattikara, Anoop, Welling, Max
In this paper we address the following question: Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it will sample from a normal approximation of the posterior, while for slow mixing rates it will mimic the behavior of SGLD with a pre-conditioner matrix. As a bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic gradients) and as such an efficient optimizer during burn-in.
Matrix Tile Analysis
Givoni, Inmar, Cheung, Vincent, Frey, Brendan J.
Many tasks require finding groups of elements in a matrix of numbers, symbols or class likelihoods. One approach is to use efficient bi- or tri-linear factorization techniques including PCA, ICA, sparse matrix factorization and plaid analysis. These techniques are not appropriate when addition and multiplication of matrix elements are not sensibly defined. More directly, methods like bi-clustering can be used to classify matrix elements, but these methods make the overly-restrictive assumption that the class of each element is a function of a row class and a column class. We introduce a general computational problem, `matrix tile analysis' (MTA), which consists of decomposing a matrix into a set of non-overlapping tiles, each of which is defined by a subset of usually nonadjacent rows and columns. MTA does not require an algebra for combining tiles, but must search over discrete combinations of tile assignments. Exact MTA is a computationally intractable integer programming problem, but we describe an approximate iterative technique and a computationally efficient sum-product relaxation of the integer program. We compare the effectiveness of these methods to PCA and plaid on hundreds of randomly generated tasks. Using double-gene-knockout data, we show that MTA finds groups of interacting yeast genes that have biologically-related functions.
A Scalable Bootstrap for Massive Data
Kleiner, Ariel, Talwalkar, Ameet, Sarkar, Purnamrita, Jordan, Michael I.
The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets---which are increasingly prevalent---the computation of bootstrap-based quantities can be prohibitively demanding computationally. While variants such as subsampling and the $m$ out of $n$ bootstrap can be used in principle to reduce the cost of bootstrap computations, we find that these methods are generally not robust to specification of hyperparameters (such as the number of subsampled data points), and they often require use of more prior information (such as rates of convergence of estimators) than the bootstrap. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate BLB's favorable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing BLB to the bootstrap, the $m$ out of $n$ bootstrap, and subsampling. In addition, we present results from a large-scale distributed implementation of BLB demonstrating its computational superiority on massive data, a method for adaptively selecting BLB's hyperparameters, an empirical study applying BLB to several real datasets, and an extension of BLB to time series data.
Greedy Algorithms for Sparse Reinforcement Learning
Painter-Wakefield, Christopher, Parr, Ronald
Feature selection and regularization are becoming increasingly prominent tools in the efforts of the reinforcement learning (RL) community to expand the reach and applicability of RL. One approach to the problem of feature selection is to impose a sparsity-inducing form of regularization on the learning method. Recent work on $L_1$ regularization has adapted techniques from the supervised learning literature for use with RL. Another approach that has received renewed attention in the supervised learning community is that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the $L_1$ regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. This paper considers variants of orthogonal matching pursuit (OMP) applied to reinforcement learning. The resulting algorithms are analyzed and compared experimentally with existing $L_1$ regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant, OMP-BRM, provides promising theoretical guarantees under certain assumptions on the feature dictionary. Another variant, OMP-TD, empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.
Copula Mixture Model for Dependency-seeking Clustering
We introduce a copula mixture model to perform dependency-seeking clustering when co-occurring samples from different data sources are available. The model takes advantage of the great flexibility offered by the copulas framework to extend mixtures of Canonical Correlation Analysis to multivariate data with arbitrary continuous marginal densities. We formulate our model as a non-parametric Bayesian mixture, while providing efficient MCMC inference. Experiments on synthetic and real data demonstrate that the increased flexibility of the copula mixture significantly improves the clustering and the interpretability of the results.
Consistent Multilabel Ranking through Univariate Losses
Dembczynski, Krzysztof, Kotlowski, Wojciech, Huellermeier, Eyke
We consider the problem of rank loss minimization in the setting of multilabel classification, which is usually tackled by means of convex surrogate losses defined on pairs of labels. Very recently, this approach was put into question by a negative result showing that commonly used pairwise surrogate losses, such as exponential and logistic losses, are inconsistent. In this paper, we show a positive result which is arguably surprising in light of the previous one: the simpler univariate variants of exponential and logistic surrogates (i.e., defined on single labels) are consistent for rank loss minimization. Instead of directly proving convergence, we give a much stronger result by deriving regret bounds and convergence rates. The proposed losses suggest efficient and scalable algorithms, which are tested experimentally.
An Adaptive Algorithm for Finite Stochastic Partial Monitoring
Bartok, Gabor, Zolghadr, Navid, Szepesvari, Csaba
We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both "easy" and "hard" problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an "easy region" of the strategy space then the regret grows as if the problem was easy. As an implication, we show that under some reasonable additional assumptions, the algorithm enjoys an O(\sqrt{T}) regret in Dynamic Pricing, proven to be hard by Bartok et al. (2011).
Discriminative Learning via Semidefinite Probabilistic Models
Crammer, Koby, Globerson, Amir
Discriminative linear models are a popular tool in machine learning. These can be generally divided into two types: The first is linear classifiers, such as support vector machines, which are well studied and provide state-of-the-art results. One shortcoming of these models is that their output (known as the 'margin') is not calibrated, and cannot be translated naturally into a distribution over the labels. Thus, it is difficult to incorporate such models as components of larger systems, unlike probabilistic based approaches. The second type of approach constructs class conditional distributions using a nonlinearity (e.g. log-linear models), but is occasionally worse in terms of classification error. We propose a supervised learning method which combines the best of both approaches. Specifically, our method provides a distribution over the labels, which is a linear function of the model parameters. As a consequence, differences between probabilities are linear functions, a property which most probabilistic models (e.g. log-linear) do not have. Our model assumes that classes correspond to linear subspaces (rather than to half spaces). Using a relaxed projection operator, we construct a measure which evaluates the degree to which a given vector 'belongs' to a subspace, resulting in a distribution over labels. Interestingly, this view is closely related to similar concepts in quantum detection theory. The resulting models can be trained either to maximize the margin or to optimize average likelihood measures. The corresponding optimization problems are semidefinite programs which can be solved efficiently. We illustrate the performance of our algorithm on real world datasets, and show that it outperforms 2nd order kernel methods.
Cross-Domain Multitask Learning with Latent Probit Models
Han, Shaobo, Liao, Xuejun, Carin, Lawrence
Learning multiple tasks across heterogeneous domains is a challenging problem since the feature space may not be the same for different tasks. We assume the data in multiple tasks are generated from a latent common domain via sparse domain transforms and propose a latent probit model (LPM) to jointly learn the domain transforms, and the shared probit classifier in the common domain. To learn meaningful task relatedness and avoid over-fitting in classification, we introduce sparsity in the domain transforms matrices, as well as in the common classifier. We derive theoretical bounds for the estimation error of the classifier in terms of the sparsity of domain transforms. An expectation-maximization algorithm is derived for learning the LPM. The effectiveness of the approach is demonstrated on several real datasets.