Statistical Learning
Parallel coordinate descent for the Adaboost problem
We design a randomised parallel version of Adaboost based on previous studies on parallel coordinate descent. The algorithm uses the fact that the logarithm of the exponential loss is a function with coordinate-wise Lipschitz continuous gradient, in order to define the step lengths. We provide the proof of convergence for this randomised Adaboost algorithm and a theoretical parallelisation speedup factor. We finally provide numerical examples on learning problems of various sizes that show that the algorithm is competitive with concurrent approaches, especially for large scale problems.
Learning Hidden Structures with Relational Models by Adequately Involving Rich Information in A Network
Fan, Xuhui, Da Xu, Richard Yi, Cao, Longbing, Song, Yin
Effectively modelling hidden structures in a network is very practical but theoretically challenging. Existing relational models only involve very limited information, namely the binary directional link data, embedded in a network to learn hidden networking structures. There is other rich and meaningful information (e.g., various attributes of entities and more granular information than binary elements such as "like" or "dislike") missed, which play a critical role in forming and understanding relations in a network. In this work, we propose an informative relational model (InfRM) framework to adequately involve rich information and its granularity in a network, including metadata information about each entity and various forms of link data. Firstly, an effective metadata information incorporation method is employed on the prior information from relational models MMSB and LFRM. This is to encourage the entities with similar metadata information to have similar hidden structures. Secondly, we propose various solutions to cater for alternative forms of link data. Substantial efforts have been made towards modelling appropriateness and efficiency, for example, using conjugate priors. We evaluate our framework and its inference algorithms in different datasets, which shows the generality and effectiveness of our models in capturing implicit structures in networks.
High dimensional Sparse Gaussian Graphical Mixture Model
This paper considers the problem of networks reconstruction from heterogeneous data using a Gaussian Graphical Mixture Model (GGMM). It is well known that parameter estimation in this context is challenging due to large numbers of variables coupled with the degeneracy of the likelihood. We propose as a solution a penalized maximum likelihood technique by imposing an $l_{1}$ penalty on the precision matrix. Our approach shrinks the parameters thereby resulting in better identifiability and variable selection. We use the Expectation Maximization (EM) algorithm which involves the graphical LASSO to estimate the mixing coefficients and the precision matrices. We show that under certain regularity conditions the Penalized Maximum Likelihood (PML) estimates are consistent. We demonstrate the performance of the PML estimator through simulations and we show the utility of our method for high dimensional data analysis in a genomic application.
Spectral Clustering with Epidemic Diffusion
Smith, Laura M., Lerman, Kristina, Garcia-Cardona, Cristina, Percus, Allon G., Ghosh, Rumi
Spectral clustering is widely used to partition graphs into distinct modules or communities. Existing methods for spectral clustering use the eigenvalues and eigenvectors of the graph Laplacian, an operator that is closely associated with random walks on graphs. We propose a new spectral partitioning method that exploits the properties of epidemic diffusion. An epidemic is a dynamic process that, unlike the random walk, simultaneously transitions to all the neighbors of a given node. We show that the replicator, an operator describing epidemic diffusion, is equivalent to the symmetric normalized Laplacian of a reweighted graph with edges reweighted by the eigenvector centralities of their incident nodes. Thus, more weight is given to edges connecting more central nodes. We describe a method that partitions the nodes based on the componentwise ratio of the replicator's second eigenvector to the first, and compare its performance to traditional spectral clustering techniques on synthetic graphs with known community structure. We demonstrate that the replicator gives preference to dense, clique-like structures, enabling it to more effectively discover communities that may be obscured by dense intercommunity linking.
Multivariate regression and fit function uncertainty
Kovesarki, Peter, Brock, Ian C.
This article describes a multivariate polynomial regression method where the uncertainty of the input parameters are approximated with Gaussian distributions, derived from the central limit theorem for large weighted sums, directly from the training sample. The estimated uncertainties can be propagated into the optimal fit function, as an alternative to the statistical bootstrap method. This uncertainty can be propagated further into a loss function like quantity, with which it is possible to calculate the expected loss function, and allows to select the optimal polynomial degree with statistical significance. Combined with simple phase space splitting methods, it is possible to model most features of the training data even with low degree polynomials or constants.
Developments in the theory of randomized shortest paths with a comparison of graph node distances
Kivimรคki, Ilkka, Shimbo, Masashi, Saerens, Marco
There have lately been several suggestions for parametrized distances on a graph that generalize the shortest path distance and the commute time or resistance distance. The need for developing such distances has risen from the observation that the above-mentioned common distances in many situations fail to take into account the global structure of the graph. In this article, we develop the theory of one family of graph node distances, known as the randomized shortest path dissimilarity, which has its foundation in statistical physics. We show that the randomized shortest path dissimilarity can be easily computed in closed form for all pairs of nodes of a graph. Moreover, we come up with a new definition of a distance measure that we call the free energy distance. The free energy distance can be seen as an upgrade of the randomized shortest path dissimilarity as it defines a metric, in addition to which it satisfies the graph-geodetic property. The derivation and computation of the free energy distance are also straightforward. We then make a comparison between a set of generalized distances that interpolate between the shortest path distance and the commute time, or resistance distance. This comparison focuses on the applicability of the distances in graph node clustering and classification. The comparison, in general, shows that the parametrized distances perform well in the tasks. In particular, we see that the results obtained with the free energy distance are among the best in all the experiments.
On statistics, computation and scalability
When coupled with the requirement that an answer to an inferential question be delivered within a certain time budget, this question has significant repercussions for the field of statistics. With the goal of identifying "time-data tradeoffs," we investigate some of the statistical consequences of computational perspectives on scability, in particular divide-and-conquer methodology and hierarchies of convex relaxations. The fields of computer science and statistics have undergone mostly separate evolutions during their respective histories. This is changing, due in part to the phenomenon of "Big Data." Indeed, science and technology are currently generating very large datasets and the gatherers of these data have increasingly ambitious inferential goals, trends which point towards a future in which statistics will be forced to deal with problems of scale in order to remain relevant. Currently the field seems little prepared to meet this challenge.
Structured learning of sum-of-submodular higher order energy functions
Fix, Alexander, Joachims, Thorsten, Park, Sam, Zabih, Ramin
Submodular functions can be exactly minimized in polynomial time, and the special case that graph cuts solve with max flow [18] has had significant impact in computer vision [5, 20, 27]. In this paper we address the important class of sum-of-submodular (SoS) functions [2, 17], which can be efficiently minimized via a variant of max flow called submodular flow [6]. SoS functions can naturally express higher order priors involving, e.g., local image patches; however, it is difficult to fully exploit their expressive power because they have so many parameters. Rather than trying to formulate existing higher order priors as an SoS function, we take a discriminative learning approach, effectively searching the space of SoS functions for a higher order prior that performs well on our training set. We adopt a structural SVM approach [14, 33] and formulate the training problem in terms of quadratic programming; as a result we can efficiently search the space of SoS priors via an extended cutting-plane algorithm. We also show how the state-of-the-art max flow method for vision problems [10] can be modified to efficiently solve the submodular flow problem. Experimental comparisons are made against the OpenCV implementation of the GrabCut interactive segmentation technique [27], which uses hand-tuned parameters instead of machine learning. On a standard dataset [11] our method learns higher order priors with hundreds of parameter values, and produces significantly better segmentations. While our focus is on binary labeling problems, we show that our techniques can be naturally generalized to handle more than two labels.
An upper bound on prototype set size for condensed nearest neighbor
The nearest neighbor (NN) rule assigns to an unclassified point the class of a closest point from a set of prototypical points. The NN algorithm stores every training point as a prototypical point and classifies new points according to the NN rule. A nice property is that, for arbitrary class distributions, as the number of training points goes to infinity, the error of the rule produced by the NN algorithm converges to within twice the Bayes error [4]. Unfortunately, storing every training point as a prototypical point can be impractical for huge training sets in terms of both memory complexity and the time complexity of classifying according to the NN rule. As a result, many techniques exist for reducing the size of the set of prototypical points.
Bayesian Inference in Sparse Gaussian Graphical Models
Orchard, Peter, Agakov, Felix, Storkey, Amos
One of the fundamental tasks of science is to find explainable relationships between observed phenomena. One approach to this task that has received attention in recent years is based on probabilistic graphical modelling with sparsity constraints on model structures. In this paper, we describe two new approaches to Bayesian inference of sparse structures of Gaussian graphical models (GGMs). One is based on a simple modification of the cutting-edge block Gibbs sampler for sparse GGMs, which results in significant computational gains in high dimensions. The other method is based on a specific construction of the Hamiltonian Monte Carlo sampler, which results in further significant improvements. We compare our fully Bayesian approaches with the popular regularisation-based graphical LASSO, and demonstrate significant advantages of the Bayesian treatment under the same computing costs. We apply the methods to a broad range of simulated data sets, and a real-life financial data set.