Plotting

 Country


Collective Graphical Models

Neural Information Processing Systems

There are many settings in which we wish to fit a model of the behavior of individuals but where our data consist only of aggregate information (counts or low-dimensional contingency tables). This paper introduces Collective Graphical Models --a framework for modeling and probabilistic inference that operates directly on the sufficient statistics of the individual model. We derive a highly-efficient Gibbs sampling algorithm for sampling from the posterior distribution of the sufficient statistics conditioned on noisy aggregate observations, prove its correctness, and demonstrate its effectiveness experimentally.


Improved Algorithms for Linear Stochastic Bandits

Neural Information Processing Systems

We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer's UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales.


Co-Training for Domain Adaptation

Neural Information Processing Systems

Domain adaptation algorithms seek to generalize a model trained in a source domain to a new target domain. In many practical cases, the source and target distributions can differ substantially, and in some cases crucial target features may not have support in the source domain. In this paper we introduce an algorithm that bridges the gap between source and target domains by slowly adding both the target features and instances in which the current algorithm is the most confident. Our algorithm is a variant of co-training, and we name it CODA (Co-training for domain adaptation). Unlike the original co-training work, we do not assume a particular feature split. Instead, for each iteration of co-training, we add target features and formulate a single optimization problem which simultaneously learns a target predictor, a split of the feature space into views, and a shared subset of source and target features to include in the predictor. CODA significantly out-performs the state-of-the-art on the 12-domain benchmark data set of Blitzer et al.. Indeed, over a wide range (65 of 84 comparisons) of target supervision, ranging from no labeled target data to a relatively large number of target labels, CODA achieves the best performance.


Similarity-based Learning via Data Driven Embeddings

Neural Information Processing Systems

We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by (Balcan-Blum 2006) and (Wang et al 2007). An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin.


Structure Learning for Optimization

Neural Information Processing Systems

We describe a family of global optimization procedures that automatically decompose optimization problems into smaller loosely coupled problems, then combine the solutions of these with message passing algorithms. We show empirically that these methods excel in avoiding local minima and produce better solutions with fewer function evaluations than existing global optimization methods. To develop these methods, we introduce a notion of coupling between variables of optimization that generalizes the notion of coupling that arises from factoring functions into terms that involve small subsets of the variables. It therefore subsumes the notion of independence between random variables in statistics, sparseness of the Hessian in nonlinear optimization, and the generalized distributive law. Despite being more general, this notion of coupling is easier to verify empirically -- making structure estimation easy -- yet it allows us to migrate well-established inference methods on graphical models to the setting of global optimization.


On the accuracy of l1-filtering of signals with block-sparse structure

Neural Information Processing Systems

We discuss new methods for the recovery of signals with block-sparse structure, based on l1-minimization. Our emphasis is on the efficiently computable error bounds for the recovery routines. We optimize these bounds with respect to the method parameters to construct the estimators with improved statistical properties. We justify the proposed approach with an oracle inequality which links the properties of the recovery algorithms and the best estimation performance.


Multiclass Boosting: Theory and Algorithms

Neural Information Processing Systems

The problem of multiclass boosting is considered. A new framework,based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets.


Bayesian Bias Mitigation for Crowdsourcing

Neural Information Processing Systems

Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the {\it effects} of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the {\it sources} of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics.


Phase transition in the family of p-resistances

Neural Information Processing Systems

We study the family of p-resistances on graphs for p ≥ 1. This family generalizes the standard resistance distance. We prove that for any fixed graph, for p=1, the p-resistance coincides with the shortest path distance, for p=2 it coincides with the standard resistance distance, and for p → ∞ it converges to the inverse of the minimal s-t-cut in the graph. Secondly, we consider the special case of random geometric graphs (such as k-nearest neighbor graphs) when the number n of vertices in the graph tends to infinity. We prove that an interesting phase-transition takes place. There exist two critical thresholds p^* and p^** such that if p < p^*, then the p-resistance depends on meaningful global properties of the graph, whereas if p > p^**, it only depends on trivial local quantities and does not convey any useful information. We can explicitly compute the critical values: p^* = 1 + 1/(d-1) and p^** = 1 + 1/(d-2) where d is the dimension of the underlying space (we believe that the fact that there is a small gap between p^* and p^** is an artifact of our proofs. We also relate our findings to Laplacian regularization and suggest to use q-Laplacians as regularizers, where q satisfies 1/p^* + 1/q = 1.


Learning unbelievable probabilities

Neural Information Processing Systems

Loopy belief propagation performs approximate inference on graphical models with loops. One might hope to compensate for the approximation by adjusting model parameters. Learning algorithms for this purpose have been explored previously, and the claim has been made that every set of locally consistent marginals can arise from belief propagation run on a graphical model. On the contrary, here we show that many probability distributions have marginals that cannot be reached by belief propagation using any set of model parameters or any learning algorithm. We call such marginals `unbelievable.' This problem occurs whenever the Hessian of the Bethe free energy is not positive-definite at the target marginals. All learning algorithms for belief propagation necessarily fail in these cases, producing beliefs or sets of beliefs that may even be worse than the pre-learning approximation. We then show that averaging inaccurate beliefs, each obtained from belief propagation using model parameters perturbed about some learned mean values, can achieve the unbelievable marginals.