Goto

Collaborating Authors

 Statistical Learning


A Flexible Iterative Framework for Consensus Clustering

arXiv.org Machine Learning

A novel framework for consensus clustering is presented which has the ability to determine both the number of clusters and a final solution using multiple algorithms. A consensus similarity matrix is formed from an ensemble using multiple algorithms and several values for k. A variety of dimension reduction techniques and clustering algorithms are considered for analysis. For noisy or high-dimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages algorithms to agree upon a common solution. We utilize the theory of nearly uncoupled Markov chains to determine the number, k , of clusters in a dataset by considering a random walk on the graph defined by the consensus matrix. The eigenvalues of the associated transition probability matrix are used to determine the number of clusters. This method succeeds at determining the number of clusters in many datasets where previous methods fail. On every considered dataset, our consensus method provides a final result with accuracy well above the average of the individual algorithms.


Determining the Number of Clusters via Iterative Consensus Clustering

arXiv.org Machine Learning

We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associated transition probability matrix are used to determine the number of clusters. For noisy or high-dimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages a block-diagonal form. It is shown that the resulting consensus matrix is generally superior to existing similarity matrices for this type of spectral analysis.


Confidence Sets Based on Thresholding Estimators in High-Dimensional Gaussian Regression Models

arXiv.org Machine Learning

We study confidence intervals based on hard-thresholding, soft-thresholding, and adaptive soft-thresholding in a linear regression model where the number of regressors $k$ may depend on and diverge with sample size $n$. In addition to the case of known error variance, we define and study versions of the estimators when the error variance is unknown. In the known variance case, we provide an exact analysis of the coverage properties of such intervals in finite samples. We show that these intervals are always larger than the standard interval based on the least-squares estimator. Asymptotically, the intervals based on the thresholding estimators are larger even by an order of magnitude when the estimators are tuned to perform consistent variable selection. For the unknown-variance case, we provide non-trivial lower bounds for the coverage probabilities in finite samples and conduct an asymptotic analysis where the results from the known-variance case can be shown to carry over asymptotically if the number of degrees of freedom $n-k$ tends to infinity fast enough in relation to the thresholding parameter.


Multithreshold Entropy Linear Classifier

arXiv.org Machine Learning

Linear classifiers separate the data with a hyperplane. In this paper we focus on the novel method of construction of multithreshold linear classifier, which separates the data with multiple parallel hyperplanes. Proposed model is based on the information theory concepts -- namely Renyi's quadratic entropy and Cauchy-Schwarz divergence. We begin with some general properties, including data scale invariance. Then we prove that our method is a multithreshold large margin classifier, which shows the analogy to the SVM, while in the same time works with much broader class of hypotheses. What is also interesting, proposed method is aimed at the maximization of the balanced quality measure (such as Matthew's Correlation Coefficient) as opposed to very common maximization of the accuracy. This feature comes directly from the optimization problem statement and is further confirmed by the experiments on the UCI datasets. It appears, that our Multithreshold Entropy Linear Classifier (MELC) obtaines similar or higher scores than the ones given by SVM on both synthetic and real data. We show how proposed approach can be benefitial for the cheminformatics in the task of ligands activity prediction, where despite better classification results, MELC gives some additional insight into the data structure (classes of underrepresented chemical compunds).


Functional Principal Component Analysis and Randomized Sparse Clustering Algorithm for Medical Image Analysis

arXiv.org Artificial Intelligence

Due to advances in sensors, growing large and complex medical image data have the ability to visualize the pathological change in the cellular or even the molecular level or anatomical changes in tissues and organs. As a consequence, the medical images have the potential to enhance diagnosis of disease, prediction of clinical outcomes, characterization of disease progression, management of health care and development of treatments, but also pose great methodological and computational challenges for representation and selection of features in image cluster analysis. To address these challenges, we first extend one dimensional functional principal component analysis to the two dimensional functional principle component analyses (2DFPCA) to fully capture space variation of image signals. Image signals contain a large number of redundant and irrelevant features which provide no additional or no useful information for cluster analysis. Widely used methods for removing redundant and irrelevant features are sparse clustering algorithms using a lasso-type penalty to select the features. However, the accuracy of clustering using a lasso-type penalty depends on how to select penalty parameters and a threshold for selecting features. In practice, they are difficult to determine. Recently, randomized algorithms have received a great deal of attention in big data analysis. This paper presents a randomized algorithm for accurate feature selection in image cluster analysis. The proposed method is applied to ovarian and kidney cancer histology image data from the TCGA database. The results demonstrate that the randomized feature selection method coupled with functional principal component analysis substantially outperforms the current sparse clustering algorithms in image cluster analysis.


Thurstonian Boltzmann Machines: Learning from Multiple Inequalities

arXiv.org Machine Learning

Restricted Boltzmann machines (RBMs) have proved to be a versatile tool for a wide variety of machine learning tasks and as a building block for deep architectures [12, 24, 28]. The original proposals mainly handle binary visible and hidden units. Whilst binary hidden units are broadly applicable as feature detectors, non-binary visible data requires different designs. Recent extensions to other data types result in type-dependent models: the Gaussian for continuous inputs [12], Beta for bounded continuous inputs [16], Poisson for count data [9], multinomial for unordered categories [25], and ordinal models for ordered categories [37, 35]. The Boltzmann distribution permits several types to be jointly modelled, thus making the RBM a good tool for multimodal and complex social survey analysis. The work of [20, 29, 40] combines continuous (e.g., visual and audio) and discrete modalities (e.g., words). The work of [34] extends the idea further to incorporate ordinal and rank data. However, there are conceptual drawbacks: First, conditioned on the hidden layer, they are still separate type-specific models; second, handling ordered categories and ranks is not natural; and third, specifying direct correlation between these types remains difficult. The main thesis of this paper is that many data types can be captured in one unified model.


Learning From Ordered Sets and Applications in Collaborative Ranking

arXiv.org Machine Learning

Ranking over sets arise when users choose between groups of items. For example, a group may be of those movies deemed $5$ stars to them, or a customized tour package. It turns out, to model this data type properly, we need to investigate the general combinatorics problem of partitioning a set and ordering the subsets. Here we construct a probabilistic log-linear model over a set of ordered subsets. Inference in this combinatorial space is highly challenging: The space size approaches $(N!/2)6.93145^{N+1}$ as $N$ approaches infinity. We propose a \texttt{split-and-merge} Metropolis-Hastings procedure that can explore the state-space efficiently. For discovering hidden aspects in the data, we enrich the model with latent binary variables so that the posteriors can be efficiently evaluated. Finally, we evaluate the proposed model on large-scale collaborative filtering tasks and demonstrate that it is competitive against state-of-the-art methods.


Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

arXiv.org Machine Learning

Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach -- the Bayesian Ising Approximation (BIA) -- to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high dimensional regression by analyzing a gene expression dataset with nearly 30,000 features.


Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases

arXiv.org Machine Learning

Following the publication of an attack on genome-wide association studies (GWAS) data proposed by Homer et al., considerable attention has been given to developing methods for releasing GWAS data in a privacy-preserving way. Here, we develop an end-to-end differentially private method for solving regression problems with convex penalty functions and selecting the penalty parameters by cross-validation. In particular, we focus on penalized logistic regression with elastic-net regularization, a method widely used to in GWAS analyses to identify disease-causing genes. We show how a differentially private procedure for penalized logistic regression with elastic-net regularization can be applied to the analysis of GWAS data and evaluate our method's performance.


Targeting Optimal Active Learning via Example Quality

arXiv.org Machine Learning

In many classification problems unlabelled data is abundant and a subset can be chosen for labelling. This defines the context of active learning (AL), where methods systematically select that subset, to improve a classifier by retraining. Given a classification problem, and a classifier trained on a small number of labelled examples, consider the selection of a single further example. This example will be labelled by the oracle and then used to retrain the classifier. This example selection raises a central question: given a fully specified stochastic description of the classification problem, which example is the optimal selection? If optimality is defined in terms of loss, this definition directly produces expected loss reduction (ELR), a central quantity whose maximum yields the optimal example selection. This work presents a new theoretical approach to AL, example quality, which defines optimal AL behaviour in terms of ELR. Once optimal AL behaviour is defined mathematically, reasoning about this abstraction provides insights into AL. In a theoretical context the optimal selection is compared to existing AL methods, showing that heuristics can make sub-optimal selections. Algorithms are constructed to estimate example quality directly. A large-scale experimental study shows these algorithms to be competitive with standard AL methods.