Statistical Learning
Lossy Compression via Sparse Linear Regression: Computationally Efficient Encoding and Decoding
Venkataramanan, Ramji, Sarkar, Tuhin, Tatikonda, Sekhar
We propose computationally efficient encoders and decoders for lossy compression using a Sparse Regression Code. The codebook is defined by a design matrix and codewords are structured linear combinations of columns of this matrix. The proposed encoding algorithm sequentially chooses columns of the design matrix to successively approximate the source sequence. It is shown to achieve the optimal distortion-rate function for i.i.d Gaussian sources under the squared-error distortion criterion. For a given rate, the parameters of the design matrix can be varied to trade off distortion performance with encoding complexity. An example of such a trade-off as a function of the block length n is the following. With computational resource (space or time) per source sample of O((n/\log n)^2), for a fixed distortion-level above the Gaussian distortion-rate function, the probability of excess distortion decays exponentially in n. The Sparse Regression Code is robust in the following sense: for any ergodic source, the proposed encoder achieves the optimal distortion-rate function of an i.i.d Gaussian source with the same variance. Simulations show that the encoder has good empirical performance, especially at low and moderate rates.
Data generator based on RBF network
There are plenty of problems where the data available is scarce and expensive. We propose a generator of semi-artificial data with similar properties to the original data which enables development and testing of different data mining algorithms and optimization of their parameters. The generated data allow a large scale experimentation and simulations without danger of overfitting. The proposed generator is based on RBF networks which learn sets of Gaussian kernels. Learned Gaussian kernels can be used in a generative mode to generate the data from the same distributions. To asses quality of the generated data we developed several workflows and used them to evaluate the statistical properties of the generated data, structural similarity, and predictive similarity using supervised and unsupervised learning techniques. To determine usability of the proposed generator we conducted a large scale evaluation using 51 UCI data sets. The results show a considerable similarity between the original and generated data and indicate that the method can be useful in several development and simulation scenarios.
Systematic Ensemble Learning for Regression
Aldave, Roberto, Dussault, Jean-Pierre
The motivation of this work is to improve the performance of standard stacking approaches or ensembles, which are composed of simple, heterogeneous base models, through the integration of the generation and selection stages for regression problems. We propose two extensions to the standard stacking approach. In the first extension we combine a set of standard stacking approaches into an ensemble of ensembles using a two-step ensemble learning in the regression setting. The second extension consists of two parts. In the initial part a diversity mechanism is injected into the original training data set, systematically generating different training subsets or partitions, and corresponding ensembles of ensembles. In the final part after measuring the quality of the different partitions or ensembles, a max-min rule-based selection algorithm is used to select the most appropriate ensemble/partition on which to make the final prediction. We show, based on experiments over a broad range of data sets, that the second extension performs better than the best of the standard stacking approaches, and is as good as the oracle of databases, which has the best base model selected by cross-validation for each data set. In addition to that, the second extension performs better than two state-of-the-art ensemble methods for regression, and it is as good as a third state-of-the-art ensemble method.
Accelerating MCMC via Parallel Predictive Prefetching
Angelino, Elaine, Kohler, Eddie, Waterland, Amos, Seltzer, Margo, Adams, Ryan P.
We present a general framework for accelerating a large class of widely used Markov chain Monte Carlo (MCMC) algorithms. Our approach exploits fast, iterative approximations to the target density to speculatively evaluate many potential future steps of the chain in parallel. The approach can accelerate computation of the target distribution of a Bayesian inference problem, without compromising exactness, by exploiting subsets of data. It takes advantage of whatever parallel resources are available, but produces results exactly equivalent to standard serial execution. In the initial burn-in phase of chain evaluation, it achieves speedup over serial evaluation that is close to linear in the number of available cores.
Information-Theoretic Multi-view Domain Adaptation: A Theoretical and Empirical Study
Multi-view learning aims to improve classification performance by leveraging the consistency among different views of data. The incorporation of multiple views was paid little attention in the studies of domain adaptation, where the view consistency based on source data is largely violated in the target domain due to the distribution gap between different domain data. In this paper, we leverage multiple views for cross-domain document classification. The central idea is to strengthen the views' consistency on target data by identifying the associations of domain-specific features from different domains. We present an Information-theoretic Multi-view Adaptation Model (IMAM) using a multi-way clustering scheme, where word and link clusters can draw together seemingly unrelated features across domains, which boosts the consistency between document clusterings that are based on the respective word and link views. Moreover, we demonstrate that IMAM can always find the document clustering with the minimal disagreement rate to the overlap of view-based clusterings. We provide both theoretical and empirical justifications of the proposed method. Our experiments show that IMAM significantly outperforms traditional multi-view algorithm co-training, the co-training-based adaptation algorithm CODA, the single-view transfer model CoCC and the large-margin-based multi-view transfer model MVTL-LM.
Beyond L2-Loss Functions for Learning Sparse Models
Ramamurthy, Karthikeyan Natesan, Aravkin, Aleksandr Y., Thiagarajan, Jayaraman J.
Incorporating sparsity priors in learning tasks can give rise to simple, and interpretable models for complex high dimensional data. Sparse models have found widespread use in structure discovery, recovering data from corruptions, and a variety of large scale unsupervised and supervised learning problems. Assuming the availability of sufficient data, these methods infer dictionaries for sparse representations by optimizing for high-fidelity reconstruction. In most scenarios, the reconstruction quality is measured using the squared Euclidean distance, and efficient algorithms have been developed for both batch and online learning cases. However, new application domains motivate looking beyond conventional loss functions. For example, robust loss functions such as $\ell_1$ and Huber are useful in learning outlier-resilient models, and the quantile loss is beneficial in discovering structures that are the representative of a particular quantile. These new applications motivate our work in generalizing sparse learning to a broad class of convex loss functions. In particular, we consider the class of piecewise linear quadratic (PLQ) cost functions that includes Huber, as well as $\ell_1$, quantile, Vapnik, hinge loss, and smoothed variants of these penalties. We propose an algorithm to learn dictionaries and obtain sparse codes when the data reconstruction fidelity is measured using any smooth PLQ cost function. We provide convergence guarantees for the proposed algorithm, and demonstrate the convergence behavior using empirical experiments. Furthermore, we present three case studies that require the use of PLQ cost functions: (i) robust image modeling, (ii) tag refinement for image annotation and retrieval and (iii) computing empirical confidence limits for subspace clustering.
Hierarchical Block Structures and High-resolution Model Selection in Large Networks
Discovering and characterizing the large-scale topological features in empirical networks are crucial steps in understanding how complex systems function. However, most existing methods used to obtain the modular structure of networks suffer from serious problems, such as being oblivious to the statistical evidence supporting the discovered patterns, which results in the inability to separate actual structure from noise. In addition to this, one also observes a resolution limit on the size of communities, where smaller but well-defined clusters are not detectable when the network becomes large. This phenomenon occurs not only for the very popular approach of modularity optimization, which lacks built-in statistical validation, but also for more principled methods based on statistical inference and model selection, which do incorporate statistical validation in a formally correct way. Here we construct a nested generative model that, through a complete description of the entire network hierarchy at multiple scales, is capable of avoiding this limitation, and enables the detection of modular structure at levels far beyond those possible with current approaches. Even with this increased resolution, the method is based on the principle of parsimony, and is capable of separating signal from noise, and thus will not lead to the identification of spurious modules even on sparse networks. Furthermore, it fully generalizes other approaches in that it is not restricted to purely assortative mixing patterns, directed or undirected graphs, and ad hoc hierarchical structures such as binary trees. Despite its general character, the approach is tractable, and can be combined with advanced techniques of community detection to yield an efficient algorithm that scales well for very large networks.
Non-uniform Feature Sampling for Decision Tree Ensembles
Kyrillidis, Anastasios, Zouzias, Anastasios
We study the effectiveness of non-uniform randomized feature selection in decision tree classification. We experimentally evaluate two feature selection methodologies, based on information extracted from the provided dataset: $(i)$ \emph{leverage scores-based} and $(ii)$ \emph{norm-based} feature selection. Experimental evaluation of the proposed feature selection techniques indicate that such approaches might be more effective compared to naive uniform feature selection and moreover having comparable performance to the random forest algorithm [3]
Disease Prediction based on Functional Connectomes using a Scalable and Spatially-Informed Support Vector Machine
Watanabe, Takanori, Kessler, Daniel, Scott, Clayton, Angstadt, Michael, Sripada, Chandra
Substantial evidence indicates that major psychiatric disorders are associated with distributed neural dysconnectivity, leading to strong interest in using neuroimaging methods to accurately predict disorder status. In this work, we are specifically interested in a multivariate approach that uses features derived from whole-brain resting state functional connectomes. However, functional connectomes reside in a high dimensional space, which complicates model interpretation and introduces numerous statistical and computational challenges. Traditional feature selection techniques are used to reduce data dimensionality, but are blind to the spatial structure of the connectomes. We propose a regularization framework where the 6-D structure of the functional connectome is explicitly taken into account via the fused Lasso or the GraphNet regularizer. Our method only restricts the loss function to be convex and margin-based, allowing non-differentiable loss functions such as the hinge-loss to be used. Using the fused Lasso or GraphNet regularizer with the hinge-loss leads to a structured sparse support vector machine (SVM) with embedded feature selection. We introduce a novel efficient optimization algorithm based on the augmented Lagrangian and the classical alternating direction method, which can solve both fused Lasso and GraphNet regularized SVM with very little modification. We also demonstrate that the inner subproblems of the algorithm can be solved efficiently in analytic form by coupling the variable splitting strategy with a data augmentation scheme. Experiments on simulated data and resting state scans from a large schizophrenia dataset show that our proposed approach can identify predictive regions that are spatially contiguous in the 6-D "connectome space," offering an additional layer of interpretability that could provide new insights about various disease processes.
Random design analysis of ridge regression
Hsu, Daniel, Kakade, Sham M., Zhang, Tong
This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.