Goto

Collaborating Authors

 Statistical Learning


Learning Stable Multilevel Dictionaries for Sparse Representations

arXiv.org Machine Learning

Sparse representations using learned dictionaries are being increasingly used with success in several data processing and machine learning applications. The availability of abundant training data necessitates the development of efficient, robust and provably good dictionary learning algorithms. Algorithmic stability and generalization are desirable characteristics for dictionary learning algorithms that aim to build global dictionaries which can efficiently model any test data similar to the training samples. In this paper, we propose an algorithm to learn dictionaries for sparse representations from large scale data, and prove that the proposed learning algorithm is stable and generalizable asymptotically. The algorithm employs a 1-D subspace clustering procedure, the K-hyperline clustering, in order to learn a hierarchical dictionary with multiple levels. We also propose an information-theoretic scheme to estimate the number of atoms needed in each level of learning and develop an ensemble approach to learn robust dictionaries. Using the proposed dictionaries, the sparse code for novel test data can be computed using a low-complexity pursuit procedure. We demonstrate the stability and generalization characteristics of the proposed algorithm using simulations. We also evaluate the utility of the multilevel dictionaries in compressed recovery and subspace learning applications.


Random Forests on Distance Matrices for Imaging Genetics Studies

arXiv.org Machine Learning

The clinical pathology of neurological diseases and the imaging of the human brain are two areas of research that have largely developed along independent lines. It is only in the past few years that the usefulness of noninvasive imaging measurements of the human brain to the diagnosis and early prediction of neurological diseases been widely recognised (Albert et al., 2011; Sperling et al., 2011; Gray et al., 2013). In Alzheimer's Disease (AD), for instance, clinical guidance on the diagnosis of this most common of neurological degenerative disorders has recently been updated to incorporate neuroimaging markers alongside standard cognitive and behavioural tests (Albert et al., 2011; Sperling et al., 2011). The key to the improved characterisation of AD lies in the quantitative nature of the imaging measurements compared to the relatively subjective and imprecise nature of traditional clinical assessments. Imaging biomarkers of cerebral atrophy and of loss of connectivity between key regions in the brain are believed to be reliable indicators of AD and are particularly useful at early disease stages when standard cognitive assessments can be inconclusive. The utility of imaging phenotypes extends beyond diagnosis and prediction to the search for the underlying genetic factors behind neurological disorders (Stein et al., 2010). This comparatively more recent use of neuroimaging measurements in place of case-control labels in genetic association studies defines the emerging field of imaging genetics. The central premise here is that, should they exist, genetic associations to intermediate brain structure and brain function phenotypes are stronger than those with the categorical clinical disease statuses further down the etiological chain (Glahn et al., 2007). Again, the example of AD serves as a good illustration.


Smooth minimization of nonsmooth functions with parallel coordinate descent methods

arXiv.org Machine Learning

We study the performance of a family of randomized parallel coordinate descent methods for minimizing the sum of a nonsmooth and separable convex functions. The problem class includes as a special case L1-regularized L1 regression and the minimization of the exponential loss ("AdaBoost problem"). We assume the input data defining the loss function is contained in a sparse $m\times n$ matrix $A$ with at most $\omega$ nonzeros in each row. Our methods need $O(n \beta/\tau)$ iterations to find an approximate solution with high probability, where $\tau$ is the number of processors and $\beta = 1 + (\omega-1)(\tau-1)/(n-1)$ for the fastest variant. The notation hides dependence on quantities such as the required accuracy and confidence levels and the distance of the starting iterate from an optimal point. Since $\beta/\tau$ is a decreasing function of $\tau$, the method needs fewer iterations when more processors are used. Certain variants of our algorithms perform on average only $O(\nnz(A)/n)$ arithmetic operations during a single iteration per processor and, because $\beta$ decreases when $\omega$ does, fewer iterations are needed for sparser problems.


Scalable Spectral Algorithms for Community Detection in Directed Networks

arXiv.org Machine Learning

Many real world problems can be effectively modeled as pairwise relationship in networks where nodes represent entities of interest and links mimic the interactions or relationships between them. The study of networks, recently referred to as network science, can provide insight into their structures and properties. One particularly interesting problem in network studies is searching for important sub-networks which are called communities, modules or groups. A community in a network is typically characterized by a group of nodes that have more links connected within the community than connected to other nodes (Fortunato, 2010). In many practical applications, the networks in study are directed in nature, such as the World Wide Web, tweeter's follower-followee network, and citation networks. Compared with in-depth studies of community structures in undirected networks (Danon et al., 2005; Fortunato, 2010; Coscia, Giannotti and Pedreschi, 2011), community detection in directed networks has not been as fruitful.


Efficient Sampling from Time-Varying Log-Concave Distributions

arXiv.org Machine Learning

We propose a computationally efficient random walk on a convex body which rapidly mixes and closely tracks a time-varying log-concave distribution. We develop general theoretical guarantees on the required number of steps; this number can be calculated on the fly according to the distance from and the shape of the next distribution. We then illustrate the technique on several examples. Within the context of exponential families, the proposed method produces samples from a posterior distribution which is updated as data arrive in a streaming fashion. The sampling technique can be used to track time-varying truncated distributions, as well as to obtain samples from a changing mixture model, fitted in a streaming fashion to data. In the setting of linear optimization, the proposed method has oracle complexity with best known dependence on the dimension for certain geometries. In the context of online learning and repeated games, the algorithm is an efficient method for implementing no-regret mixture forecasting strategies. Remarkably, in some of these examples, only one step of the random walk is needed to track the next distribution.


Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming

arXiv.org Machine Learning

In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method possesses a nearly optimal rate of convergence if the problem is convex. We discuss a variant of the algorithm which consists of applying a post-optimization phase to evaluate a short list of solutions generated by several independent runs of the RSG method, and show that such modification allows to improve significantly the large-deviation properties of the algorithm. These methods are then specialized for solving a class of simulation-based optimization problems in which only stochastic zeroth-order information is available.


Latent Fisher Discriminant Analysis

arXiv.org Machine Learning

Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification. Previous studies have also extended the binary-class case into multi-classes. However, many applications, such as object detection and keyframe extraction cannot provide consistent instance-label pairs, while LDA requires labels on instance level for training. Thus it cannot be directly applied for semi-supervised classification problem. In this paper, we overcome this limitation and propose a latent variable Fisher discriminant analysis model. We relax the instance-level labeling into bag-level, is a kind of semi-supervised (video-level labels of event type are required for semantic frame extraction) and incorporates a data-driven prior over the latent variables. Hence, our method combines the latent variable inference and dimension reduction in an unified bayesian framework. We test our method on MUSK and Corel data sets and yield competitive results compared to the baseline approach. We also demonstrate its capacity on the challenging TRECVID MED11 dataset for semantic keyframe extraction and conduct a human-factors ranking-based experimental evaluation, which clearly demonstrates our proposed method consistently extracts more semantically meaningful keyframes than challenging baselines.


mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

arXiv.org Machine Learning

High-throughput sequencing technology applied to cellular mRNA (RNA-Seq) has revolutionized transcriptome studies [19, 17, 35, among many others]. In contrast to microarray platforms, which it has replaced in many applications, RNA-Seq can not only be used to accurately quantify known transcripts, but also to reveal the precise structure of transcripts at single-nucleotide resolution. RNA-Seq based transcript reconstruction has therefore become a valuable tool for the completion of genome annotations [22, 11, for instance] and further enabled subsequent analyses of differentially expressed genes [2], transcript isoforms [6, 4] and exons [3], all of which generally rely on correctly inferred transcript inventories. De novo transcript reconstruction is thus a pivotal step in the analysis of RNA-Seq data. There are two conceptually different strategies to approach this problem: one can either assemble transcripts directly from RNA-Seq reads using methodology that originated from genome assembly approaches [13, 23, 25].


A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

arXiv.org Machine Learning

The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.


HOL(y)Hammer: Online ATP Service for HOL Light

arXiv.org Artificial Intelligence

HOL(y)Hammer is an online AI/ATP service for formal (computer-understandable) mathematics encoded in the HOL Light system. The service allows its users to upload and automatically process an arbitrary formal development (project) based on HOL Light, and to attack arbitrary conjectures that use the concepts defined in some of the uploaded projects. For that, the service uses several automated reasoning systems combined with several premise selection methods trained on all the project proofs. The projects that are readily available on the server for such query answering include the recent versions of the Flyspeck, Multivariate Analysis and Complex Analysis libraries. The service runs on a 48-CPU server, currently employing in parallel for each task 7 AI/ATP combinations and 4 decision procedures that contribute to its overall performance. The system is also available for local installation by interested users, who can customize it for their own proof development. An Emacs interface allowing parallel asynchronous queries to the service is also provided. The overall structure of the service is outlined, problems that arise and their solutions are discussed, and an initial account of using the system is given.