Statistical Learning
Combining Fuzzy Cognitive Maps and Discrete Random Variables
In this paper we propose an extension to the Fuzzy Cognitive Maps (FCMs) that aims at aggregating a number of reasoning tasks into a one parallel run. The described approach consists in replacing real-valued activation levels of concepts (and further influence weights) by random variables. Such extension, followed by the implemented software tool, allows for determining ranges reached by concept activation levels, sensitivity analysis as well as statistical analysis of multiple reasoning results. We replace multiplication and addition operators appearing in the FCM state equation by appropriate convolutions applicable for discrete random variables. To make the model computationally feasible, it is further augmented with aggregation operations for discrete random variables. We discuss four implemented aggregators, as well as we report results of preliminary tests.
Joint limiting laws for high-dimensional independence tests
Testing independence is of significant interest in many important areas of large-scale inference. Using extreme-value form statistics to test against sparse alternatives and using quadratic form statistics to test against dense alternatives are two important testing procedures for high-dimensional independence. However, quadratic form statistics suffer from low power against sparse alternatives, and extreme-value form statistics suffer from low power against dense alternatives with small disturbances and may have size distortions due to its slow convergence. For real-world applications, it is important to derive powerful testing procedures against more general alternatives. Based on intermediate limiting distributions, we derive (model-free) joint limiting laws of extreme-value form and quadratic form statistics, and surprisingly, we prove that they are asymptotically independent. Given such asymptotic independencies, we propose (model-free) testing procedures to boost the power against general alternatives and also retain the correct asymptotic size. Under the high-dimensional setting, we derive the closed-form limiting null distributions, and obtain their explicit rates of uniform convergence. We prove their consistent statistical powers against general alternatives. We demonstrate the performance of our proposed test statistics in simulation studies. Our work provides very helpful insights to high-dimensional independence tests, and fills an important gap.
Matrix Completion Under Monotonic Single Index Models
Ganti, Ravi, Balzano, Laura, Willett, Rebecca
Most recent results in matrix completion assume that the matrix under consideration is low-rank or that the columns are in a union of low-rank subspaces. In real-world settings, however, the linear structure underlying these models is distorted by a (typically unknown) nonlinear transformation. This paper addresses the challenge of matrix completion in the face of such nonlinearities. Given a few observations of a matrix that are obtained by applying a Lipschitz, monotonic function to a low rank matrix, our task is to estimate the remaining unobserved entries. We propose a novel matrix completion method that alternates between low-rank matrix estimation and monotonic function estimation to estimate the missing matrix elements. Mean squared error bounds provide insight into how well the matrix can be estimated based on the size, rank of the matrix and properties of the nonlinear transformation. Empirical results on synthetic and real-world datasets demonstrate the competitiveness of the proposed approach.
Learning with a Wasserstein Loss
Frogner, Charlie, Zhang, Chiyuan, Mobahi, Hossein, Araya-Polo, Mauricio, Poggio, Tomaso
Tomaso Poggio Center for Brains, Minds and Machines Massachusetts Institute of Technology tp@ai.mit.edu Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact Wasserstein distance is costly, recent work has described a regularized approximation that is efficiently computed. We describe an efficient learning algorithm based on this regularization, as well as a novel extension of the Wasserstein distance from probability measures to unnormalized measures. We also describe a statistical learning bound for the loss. The Wasserstein loss can encourage smoothness of the predictions with respect to a chosen metric on the output space. We demonstrate this property on a real-data tag prediction problem, using the Yahoo Flickr Creative Commons dataset, outperforming a baseline that doesn't use the metric.
Conditional probability generation methods for high reliability effects-based decision making
Garn, Wolfgang, Louvieris, Panos
Decision making is often based on Bayesian networks. The building blocks for Bayesian networks are its conditional probability tables (CPTs). These tables are obtained by parameter estimation methods, or they are elicited from subject matter experts (SME). Some of these knowledge representations are insufficient approximations. Using knowledge fusion of cause and effect observations lead to better predictive decisions. We propose three new methods to generate CPTs, which even work when only soft evidence is provided. The first two are novel ways of mapping conditional expectations to the probability space. The third is a column extraction method, which obtains CPTs from nonlinear functions such as the multinomial logistic regression. Case studies on military effects and burnt forest desertification have demonstrated that so derived CPTs have highly reliable predictive power, including superiority over the CPTs obtained from SMEs. In this context, new quality measures for determining the goodness of a CPT and for comparing CPTs with each other have been introduced. The predictive power and enhanced reliability of decision making based on the novel CPT generation methods presented in this paper have been confirmed and validated within the context of the case studies.
New Perspectives on $k$-Support and Cluster Norms
McDonald, Andrew M., Pontil, Massimiliano, Stamos, Dimitris
We study a regularizer which is defined as a parameterized infimum of quadratics, and which we call the box-norm. We show that the k-support norm, a regularizer proposed by Argyriou et al. (2012) for sparse vector prediction problems, belongs to this family, and the box-norm can be generated as a perturbation of the former. We derive an improved algorithm to compute the proximity operator of the squared box-norm, and we provide a method to compute the norm. We extend the norms to matrices, introducing the spectral k-support norm and spectral box-norm. We note that the spectral box-norm is essentially equivalent to the cluster norm, a multitask learning regularizer introduced by Jacob et al. (2009a), and which in turn can be interpreted as a perturbation of the spectral k-support norm. Centering the norm is important for multitask learning and we also provide a method to use centered versions of the norms as regularizers. Numerical experiments indicate that the spectral k-support and box-norms and their centered variants provide state of the art performance in matrix completion and multitask learning problems respectively.
Statistical and Computational Guarantees for the Baum-Welch Algorithm
Yang, Fanny, Balakrishnan, Sivaraman, Wainwright, Martin J.
The Hidden Markov Model (HMM) is one of the mainstays of statistical modeling of discrete time series, with applications including speech recognition, computational biology, computer vision and econometrics. Estimating an HMM from its observation process is often addressed via the Baum-Welch algorithm, which is known to be susceptible to local optima. In this paper, we first give a general characterization of the basin of attraction associated with any global optimum of the population likelihood. By exploiting this characterization, we provide non-asymptotic finite sample guarantees on the Baum-Welch updates, guaranteeing geometric convergence to a small ball of radius on the order of the minimax rate around a global optimum. As a concrete example, we prove a linear rate of convergence for a hidden Markov mixture of two isotropic Gaussians given a suitable mean separation and an initialization within a ball of large radius around (one of) the true parameters. To our knowledge, these are the first rigorous local convergence guarantees to global optima for the Baum-Welch algorithm in a setting where the likelihood function is nonconvex. We complement our theoretical results with thorough numerical simulations studying the convergence of the Baum-Welch algorithm and illustrating the accuracy of our predictions.
Large-Scale Optimization Algorithms for Sparse Conditional Gaussian Graphical Models
McCarter, Calvin, Kim, Seyoung
This paper addresses the problem of scalable optimization for L1-regularized conditional Gaussian graphical models. Conditional Gaussian graphical models generalize the well-known Gaussian graphical models to conditional distributions to model the output network influenced by conditioning input variables. While highly scalable optimization methods exist for sparse Gaussian graphical model estimation, state-of-the-art methods for conditional Gaussian graphical models are not efficient enough and more importantly, fail due to memory constraints for very large problems. In this paper, we propose a new optimization procedure based on a Newton method that efficiently iterates over two sub-problems, leading to drastic improvement in computation time compared to the previous methods. We then extend our method to scale to large problems under memory constraints, using block coordinate descent to limit memory usage while achieving fast convergence. Using synthetic and genomic data, we show that our methods can solve one million dimensional problems to high accuracy in a little over a day on a single machine.
K2-ABC: Approximate Bayesian Computation with Kernel Embeddings
Park, Mijung, Jitkrittum, Wittawat, Sejdinovic, Dino
Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulation-based posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will "leak" information, which leads to ABC algorithms yielding samples from an incorrect (partial) posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2-ABC, uses maximum mean discrepancy (MMD) to construct a dissimilarity measure between the observed and simulated data. The embedding of an empirical distribution of the data into a reproducing kernel Hilbert space plays a role of the summary statistic and is sufficient whenever the corresponding kernels are characteristic. Experiments on a simulated scenario and a real-world biological problem illustrate the effectiveness of the proposed algorithm. M Park and W Jitkrittum contributed equally.
Statistical Learning under Nonstationary Mixing Processes
Hanneke, Steve, Jaakkola, Tommi, Yang, Liu
We study a special case of the problem of statistical learning without the i.i.d. assumption. Specifically, we suppose a learning method is presented with a sequence of data points, and required to make a prediction (e.g., a classification) for each one, and can then observe the loss incurred by this prediction. We go beyond traditional analyses, which have focused on stationary mixing processes or nonstationary product processes, by combining these two relaxations to allow nonstationary mixing processes. We are particularly interested in the case of $\beta$-mixing processes, with the sum of changes in marginal distributions growing sublinearly in the number of samples. Under these conditions, we propose a learning method, and establish that for bounded VC subgraph classes, the cumulative excess risk grows sublinearly in the number of predictions, at a quantified rate.