Goto

Collaborating Authors

 Accuracy


Framework for Multi-task Multiple Kernel Learning and Applications in Genome Analysis

arXiv.org Machine Learning

We present a general regularization-based framework for Multi-task learning (MTL), in which the similarity between tasks can be learned or refined using $\ell_p$-norm Multiple Kernel learning (MKL). Based on this very general formulation (including a general loss function), we derive the corresponding dual formulation using Fenchel duality applied to Hermitian matrices. We show that numerous established MTL methods can be derived as special cases from both, the primal and dual of our formulation. Furthermore, we derive a modern dual-coordinate descend optimization strategy for the hinge-loss variant of our formulation and provide convergence bounds for our algorithm. As a special case, we implement in C++ a fast LibLinear-style solver for $\ell_p$-norm MKL. In the experimental section, we analyze various aspects of our algorithm such as predictive performance and ability to reconstruct task relationships on biologically inspired synthetic data, where we have full control over the underlying ground truth. We also experiment on a new dataset from the domain of computational biology that we collected for the purpose of this paper. It concerns the prediction of transcription start sites (TSS) over nine organisms, which is a crucial task in gene finding. Our solvers including all discussed special cases are made available as open-source software as part of the SHOGUN machine learning toolbox (available at \url{http://shogun.ml}).


Integrative analysis of gene expression and phenotype data

arXiv.org Machine Learning

The linking genotype to phenotype is the fundamental aim of modern genetics. We focus on study of links between gene expression data and phenotype data through integrative analysis. We propose three approaches. 1) The inherent complexity of phenotypes makes high-throughput phenotype profiling a very difficult and laborious process. We propose a method of automated multi-dimensional profiling which uses gene expression similarity. Large-scale analysis show that our method can provide robust profiling that reveals different phenotypic aspects of samples. This profiling technique is also capable of interpolation and extrapolation beyond the phenotype information given in training data. It can be used in many applications, including facilitating experimental design and detecting confounding factors. 2) Phenotype association analysis problems are complicated by small sample size and high dimensionality. Consequently, phenotype-associated gene subsets obtained from training data are very sensitive to selection of training samples, and the constructed sample phenotype classifiers tend to have poor generalization properties. To eliminate these obstacles, we propose a novel approach that generates sequences of increasingly discriminative gene cluster combinations. Our experiments on both simulated and real datasets show robust and accurate classification performance. 3) Many complex phenotypes, such as cancer, are the product of not only gene expression, but also gene interaction. We propose an integrative approach to find gene network modules that activate under different phenotype conditions. Using our method, we discovered cancer subtype-specific network modules, as well as the ways in which these modules coordinate. In particular, we detected a breast-cancer specific tumor suppressor network module with a hub gene, PDGFRL, which may play an important role in this module.


An Efficient Post-Selection Inference on High-Order Interaction Models

arXiv.org Machine Learning

Finding statistically significant high-order interaction features in predictive modeling is important but challenging task. The difficulty lies in the fact that, for a recent applications with high-dimensional covariates, the number of possible high-order interaction features would be extremely large. Identifying statistically significant features from such a huge pool of candidates would be highly challenging both in computational and statistical senses. To work with this problem, we consider a two stage algorithm where we first select a set of high-order interaction features by marginal screening, and then make statistical inferences on the regression model fitted only with the selected features. Such statistical inferences are called post-selection inference (PSI), and receiving an increasing attention in the literature. One of the seminal recent advancements in PSI literature is the works by Lee et al. where the authors presented an algorithmic framework for computing exact sampling distributions in PSI. A main challenge when applying their approach to our high-order interaction models is to cope with the fact that PSI in general depends not only on the selected features but also on the unselected features, making it hard to apply to our extremely high-dimensional high-order interaction models. The goal of this paper is to overcome this difficulty by introducing a novel efficient method for PSI. Our key idea is to exploit the underlying tree structure among high-order interaction features, and to develop a pruning method of the tree which enables us to quickly identify a group of unselected features that are guaranteed to have no influence on PSI. The experimental results indicate that the proposed method allows us to reliably identify statistically significant high-order interaction features with reasonable computational cost.


Fairness-Aware Learning with Restriction of Universal Dependency using f-Divergences

arXiv.org Machine Learning

Fairness-aware learning is a novel framework for classification tasks. Like regular empirical risk minimization (ERM), it aims to learn a classifier with a low error rate, and at the same time, for the predictions of the classifier to be independent of sensitive features, such as gender, religion, race, and ethnicity. Existing methods can achieve low dependencies on given samples, but this is not guaranteed on unseen samples. The existing fairness-aware learning algorithms employ different dependency measures, and each algorithm is specifically designed for a particular one. Such diversity makes it difficult to theoretically analyze and compare them. In this paper, we propose a general framework for fairness-aware learning that uses f-divergences and that covers most of the dependency measures employed in the existing methods. We introduce a way to estimate the f-divergences that allows us to give a unified analysis for the upper bound of the estimation error; this bound is tighter than that of the existing convergence rate analysis of the divergence estimation. With our divergence estimate, we propose a fairness-aware learning algorithm, and perform a theoretical analysis of its generalization error. Our analysis reveals that, under mild assumptions and even with enforcement of fairness, the generalization error of our method is $O(\sqrt{1/n})$, which is the same as that of the regular ERM. In addition, and more importantly, we show that, for any f-divergence, the upper bound of the estimation error of the divergence is $O(\sqrt{1/n})$. This indicates that our fairness-aware learning algorithm guarantees low dependencies on unseen samples for any dependency measure represented by an f-divergence.


Role of normalization in spectral clustering for stochastic blockmodels

arXiv.org Machine Learning

Spectral clustering is a technique that clusters elements using the top few eigenvectors of their (possibly normalized) similarity matrix. The quality of spectral clustering is closely tied to the convergence properties of these principal eigenvectors. This rate of convergence has been shown to be identical for both the normalized and unnormalized variants in recent random matrix theory literature. However, normalization for spectral clustering is commonly believed to be beneficial [Stat. Comput. 17 (2007) 395-416]. Indeed, our experiments show that normalization improves prediction accuracy. In this paper, for the popular stochastic blockmodel, we theoretically show that normalization shrinks the spread of points in a class by a constant fraction under a broad parameter regime. As a byproduct of our work, we also obtain sharp deviation bounds of empirical principal eigenvalues of graphs generated from a stochastic blockmodel.


Optimal model-free prediction from multivariate time series

arXiv.org Machine Learning

Forecasting a time series from multivariate predictors constitutes a challenging problem, especially using model-free approaches. Most techniques, such as nearest-neighbor prediction, quickly suffer from the curse of dimensionality and overfitting for more than a few predictors which has limited their application mostly to the univariate case. Therefore, selection strategies are needed that harness the available information as efficiently as possible. Since often the right combination of predictors matters, ideally all subsets of possible predictors should be tested for their predictive power, but the exponentially growing number of combinations makes such an approach computationally prohibitive. Here a prediction scheme that overcomes this strong limitation is introduced utilizing a causal pre-selection step which drastically reduces the number of possible predictors to the most predictive set of causal drivers making a globally optimal search scheme tractable. The information-theoretic optimality is derived and practical selection criteria are discussed. As demonstrated for multivariate nonlinear stochastic delay processes, the optimal scheme can even be less computationally expensive than commonly used sub-optimal schemes like forward selection. The method suggests a general framework to apply the optimal model-free approach to select variables and subsequently fit a model to further improve a prediction or learn statistical dependencies. The performance of this framework is illustrated on a climatological index of El Ni\~no Southern Oscillation.


A tree augmented naive Bayesian network experiment for breast cancer prediction

arXiv.org Machine Learning

In order to investigate the breast cancer prediction problem on the aging population with the grades of DCIS, we conduct a tree augmented naive Bayesian network experiment trained and tested on a large clinical dataset including consecutive diagnostic mammography examinations, consequent biopsy outcomes and related cancer registry records in the population of women across all ages. Our tasks are to classify the conventional "Benign vs. Malignant" and the new "Benign/LG vs. IntG/HG/Invasive" based on mammography examination features and patient demographic information, specifically to predict the probability of malignancy, for the biopsy threshold setting and the biopsy decision making. The aggregated results of our tenfold cross validation method recommend a biopsy threshold higher than 2% for the aging population. The Receiver Operating Characteristic curves and the Precision-Recall curves by aggregating the tenfold cross validation results are interesting.


LOCO: Distributing Ridge Regression with Random Projections

arXiv.org Machine Learning

We propose Loco, an algorithm for large-scale ridge regression which distributes the features across workers on a cluster. Important dependencies between variables are preserved using structured random projections which are cheap to compute and must only be communicated once. We show that Loco obtains a solution which is close to the exact ridge regression solution in the fixed design setting. We verify this experimentally in a simulation study as well as an application to climate prediction. Furthermore, we show that Loco achieves significant speedups compared with a state-of-the-art distributed algorithm on a large-scale regression problem.


Randomized Structural Sparsity via Constrained Block Subsampling for Improved Sensitivity of Discriminative Voxel Identification

arXiv.org Machine Learning

In this paper, we consider voxel selection for functional Magnetic Resonance Imaging (fMRI) brain data with the aim of finding a more complete set of probably correlated discriminative voxels, thus improving interpretation of the discovered potential biomarkers. The main difficulty in doing this is an extremely high dimensional voxel space and few training samples, resulting in unreliable feature selection. In order to deal with the difficulty, stability selection has received a great deal of attention lately, especially due to its finite sample control of false discoveries and transparent principle for choosing a proper amount of regularization. However, it fails to make explicit use of the correlation property or structural information of these discriminative features and leads to large false negative rates. In other words, many relevant but probably correlated discriminative voxels are missed. Thus, we propose a new variant on stability selection "randomized structural sparsity", which incorporates the idea of structural sparsity. Numerical experiments demonstrate that our method can be superior in controlling for false negatives while also keeping the control of false positives inherited from stability selection.


High-dimensional Ordinary Least-squares Projection for Screening Variables

arXiv.org Machine Learning

Variable selection is a challenging issue in statistical applications when the number of predictors $p$ far exceeds the number of observations $n$. In this ultra-high dimensional setting, the sure independence screening (SIS) procedure was introduced to significantly reduce the dimensionality by preserving the true model with overwhelming probability, before a refined second stage analysis. However, the aforementioned sure screening property strongly relies on the assumption that the important variables in the model have large marginal correlations with the response, which rarely holds in reality. To overcome this, we propose a novel and simple screening technique called the high-dimensional ordinary least-squares projection (HOLP). We show that HOLP possesses the sure screening property and gives consistent variable selection without the strong correlation assumption, and has a low computational complexity. A ridge type HOLP procedure is also discussed. Simulation study shows that HOLP performs competitively compared to many other marginal correlation based methods. An application to a mammalian eye disease data illustrates the attractiveness of HOLP.