Goto

Collaborating Authors

 Mathematical & Statistical Methods


Edge-exchangeable graphs and sparsity

arXiv.org Machine Learning

A known failing of many popular random graph models is that the Aldous-Hoover Theorem guarantees these graphs are dense with probability one; that is, the number of edges grows quadratically with the number of nodes. This behavior is considered unrealistic in observed graphs. We define a notion of edge exchangeability for random graphs in contrast to the established notion of infinite exchangeability for random graphs --- which has traditionally relied on exchangeability of nodes (rather than edges) in a graph. We show that, unlike node exchangeability, edge exchangeability encompasses models that are known to provide a projective sequence of random graphs that circumvent the Aldous-Hoover Theorem and exhibit sparsity, i.e., sub-quadratic growth of the number of edges with the number of nodes. We show how edge-exchangeability of graphs relates naturally to existing notions of exchangeability from clustering (a.k.a. partitions) and other familiar combinatorial structures.


Analysis of Crowdsourced Sampling Strategies for HodgeRank with Sparse Random Graphs

arXiv.org Machine Learning

Crowdsourcing enables researchers to conduct social experiments on a heterogenous set of participants and at a lower economic cost than conventional laboratory studies. For example, researchers can harness internet users to conduct user studies on their personal computers. Among various approaches to conduct subjective tests, pairwise comparisons are expected to yield more reliable results. However, in crowdsourced studies, the individuals performing the ratings are diverse compared to more controlled settings, which is difficult to control for using traditional experimental designs; researchers have recently proposed several randomized methods to conduct user studies [1, 2, 3], which accommodate incomplete and imbalanced data. HodgeRank, as an application of combinatorial Hodge theory to the preference or rank aggregation problem from pairwise comparison data, possibly being incomplete and imbalanced, was first introduced by [4], and inspired a series of studies in statistical ranking [5, 6, 7, 8]. Hodge theory has also found applications in game theory [9] and computer vision [10, 11], in addition to traditional applications in fluid mechanics [12] etc. HodgeRank formulates the ranking problem in terms of the discrete Hodge decomposition of the pairwise data and shows that it can be decomposed into three orthogonal components: a gradient flow representing a global rating (optimal in the L


Conditional Risk Minimization for Stochastic Processes

arXiv.org Machine Learning

We study the task of learning from non-i.i.d. data. In particular, we aim at learning predictors that minimize the conditional risk for a stochastic process, i.e. the expected loss of the predictor on the next point conditioned on the set of training samples observed so far. For non-i.i.d. data, the training set contains information about the upcoming samples, so learning with respect to the conditional distribution can be expected to yield better predictors than one obtains from the classical setting of minimizing the marginal risk. Our main contribution is a practical estimator for the conditional risk based on the theory of non-parametric time-series prediction, and a finite sample concentration bound that establishes uniform convergence of the estimator to the true conditional risk under certain regularity assumptions on the process.


Stochastic dual averaging methods using variance reduction techniques for regularized empirical risk minimization problems

arXiv.org Machine Learning

We consider a composite convex minimization problem associated with regularized empirical risk minimization, which often arises in machine learning. We propose two new stochastic gradient methods that are based on stochastic dual averaging method with variance reduction. Our methods generate a sparser solution than the existing methods because we do not need to take the average of the history of the solutions. This is favorable in terms of both interpretability and generalization. Moreover, our methods have theoretical support for both a strongly and a non-strongly convex regularizer and achieve the best known convergence rates among existing nonaccelerated stochastic gradient methods.


Stochastic Process Bandits: Upper Confidence Bounds Algorithms via Generic Chaining

arXiv.org Machine Learning

The paper considers the problem of global optimization in the setup of stochastic process bandits. We introduce an UCB algorithm which builds a cascade of discretization trees based on generic chaining in order to render possible his operability over a continuous domain. The theoretical framework applies to functions under weak probabilistic smoothness assumptions and also extends significantly the spectrum of application of UCB strategies. Moreover generic regret bounds are derived which are then specialized to Gaussian processes indexed on infinite-dimensional spaces as well as to quadratic forms of Gaussian processes. Lower bounds are also proved in the case of Gaussian processes to assess the optimality of the proposed algorithm.


A Variational Analysis of Stochastic Gradient Algorithms

arXiv.org Machine Learning

Stochastic Gradient Descent (SGD) is an important algorithm in machine learning. With constant learning rates, it is a stochastic process that, after an initial phase of convergence, generates samples from a stationary distribution. We show that SGD with constant rates can be effectively used as an approximate posterior inference algorithm for probabilistic modeling. Specifically, we show how to adjust the tuning parameters of SGD such as to match the resulting stationary distribution to the posterior. This analysis rests on interpreting SGD as a continuous-time stochastic process and then minimizing the Kullback-Leibler divergence between its stationary distribution and the target posterior. (This is in the spirit of variational inference.) In more detail, we model SGD as a multivariate Ornstein-Uhlenbeck process and then use properties of this process to derive the optimal parameters. This theoretical framework also connects SGD to modern scalable inference algorithms; we analyze the recently proposed stochastic gradient Fisher scoring under this perspective. We demonstrate that SGD with properly chosen constant rates gives a new way to optimize hyperparameters in probabilistic models.


Achieving Exact Cluster Recovery Threshold via Semidefinite Programming

arXiv.org Machine Learning

The community detection problem refers to finding the underlying communities within a network using only the knowledge of the network topology [16, 31]. This paper considers the following probabilistic model for generating a network with underlying community structures: Suppose that out of a total of n vertices, rK of them are partitioned into r clusters of size K, and the remaining n rK vertices do not belong to any clusters (called outlier vertices); a random graph G is generated based on the cluster structure, where each pair of vertices is connected independently with probability p if they are in the same cluster or q otherwise. In particular, an outlier vertex is connected to any other vertex with probability q. This random graph ensemble is known as the planted cluster model [10] with parameters n, r, K N and p, q [0, 1] such that n rK. In particular, we call p and q the in-cluster and cross-cluster edge density, respectively. The planted cluster model encompasses several classical planted random graph models including planted clique [5], planted coloring [4], planted dense subgraph [6], planted partition [11], and the stochastic block model [23], which have been widely used for studying the community detection and graph partitioning problem (see, e.g., [26, 12, 30, 9] and the references therein). This paper was accepted to IEEE Transactions on Information Theory on January 3, 2016. The material in this paper was presented in part at the IEEE International Symposium on Information Theory, Hong Kong, June, 2015 [20].


COEVOLVE: A Joint Point Process Model for Information Diffusion and Network Co-evolution

Neural Information Processing Systems

Information diffusion in online social networks is affected by the underlying network topology, but it also has the power to change it. Online users are constantly creating new links when exposed to new information sources, and in turn these links are alternating the way information spreads. However, these two highly intertwined stochastic processes, information diffusion and network evolution, have been predominantly studied separately, ignoring their co-evolutionary dynamics.We propose a temporal point process model, COEVOLVE, for such joint dynamics, allowing the intensity of one process to be modulated by that of the other. This model allows us to efficiently simulate interleaved diffusion and network events, and generate traces obeying common diffusion and network patterns observed in real-world networks. Furthermore, we also develop a convex optimization framework to learn the parameters of the model from historical diffusion and network evolution traces. We experimented with both synthetic data and data gathered from Twitter, and show that our model provides a good fit to the data as well as more accurate predictions than alternatives.


Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale

arXiv.org Machine Learning

Recent years have witnessed amazing outcomes from "Big Models" trained by "Big Data". Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the training data in each iteration. Typically, the data are either uniformly sampled or sequentially accessed. In this paper, we study how the data access pattern can affect model training. We propose an Active Sampler algorithm, where training data with more "learning value" to the model are sampled more frequently. The goal is to focus training effort on valuable instances near the classification boundaries, rather than evident cases, noisy data or outliers. We show the correctness and optimality of Active Sampler in theory, and then develop a light-weight vectorized implementation. Active Sampler is orthogonal to most approaches optimizing the efficiency of large-scale data analytics, and can be applied to most analytics models trained by stochastic gradient descent (SGD) algorithm. Extensive experimental evaluations demonstrate that Active Sampler can speed up the training procedure of SVM, feature selection and deep learning, for comparable training quality by 1.6-2.2x.


Sharp Finite-Time Iterated-Logarithm Martingale Concentration

arXiv.org Machine Learning

Martingales are indispensable in studying the temporal dynamics of stochastic processes arising in a multitude of fields [10, 14]. Particularly when such processes have complex long-range dependences, it is often of interest to concentrate martingales uniformly over time. On the theoretical side, a fundamental limit to such concentration is expressed by the law of the iterated logarithm (LIL). However, this only concerns asymptotic behavior. In many applications, it is more natural to instead consider concentration that holds uniformly over all finite times. This manuscript presents such bounds for the large classes of martingales which are addressed by Hoeffding [11] and Bernstein [8] inequalities. These new results are optimal within small constants, and can be viewed as finite-time generalizations of the upper half of the LIL.