AITopics | Statistical Learning

Collaborating Authors

Statistical Learning

News Overviews Instructional Materials AI-Alerts Classics

A Flexible Iterative Framework for Consensus Clustering

arXiv.org Machine LearningAug-5-2014

A novel framework for consensus clustering is presented which has the ability to determine both the number of clusters and a final solution using multiple algorithms. A consensus similarity matrix is formed from an ensemble using multiple algorithms and several values for k. A variety of dimension reduction techniques and clustering algorithms are considered for analysis. For noisy or high-dimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages algorithms to agree upon a common solution. We utilize the theory of nearly uncoupled Markov chains to determine the number, k , of clusters in a dataset by considering a random walk on the graph defined by the consensus matrix. The eigenvalues of the associated transition probability matrix are used to determine the number of clusters. This method succeeds at determining the number of clusters in many datasets where previous methods fail. On every considered dataset, our consensus method provides a final result with accuracy well above the average of the individual algorithms.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1408.0972

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Determining the Number of Clusters via Iterative Consensus Clustering

Race, Shaina, Meyer, Carl, Valakuzhy, Kevin

arXiv.org Machine LearningAug-5-2014

We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associated transition probability matrix are used to determine the number of clusters. For noisy or high-dimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages a block-diagonal form. It is shown that the resulting consensus matrix is generally superior to existing similarity matrices for this type of spectral analysis.

artificial intelligence, machine learning, matrix, (15 more...)

arXiv.org Machine Learning

doi: 10.1137/1.9781611972832.11

1408.0967

Country: North America > United States > North Carolina (0.15)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Confidence Sets Based on Thresholding Estimators in High-Dimensional Gaussian Regression Models

Schneider, Ulrike

arXiv.org Machine LearningAug-5-2014

We study confidence intervals based on hard-thresholding, soft-thresholding, and adaptive soft-thresholding in a linear regression model where the number of regressors $k$ may depend on and diverge with sample size $n$. In addition to the case of known error variance, we define and study versions of the estimators when the error variance is unknown. In the known variance case, we provide an exact analysis of the coverage properties of such intervals in finite samples. We show that these intervals are always larger than the standard interval based on the least-squares estimator. Asymptotically, the intervals based on the thresholding estimators are larger even by an order of magnitude when the estimators are tuned to perform consistent variable selection. For the unknown-variance case, we provide non-trivial lower bounds for the coverage probabilities in finite samples and conduct an asymptotic analysis where the results from the known-variance case can be shown to carry over asymptotically if the number of degrees of freedom $n-k$ tends to infinity fast enough in relation to the thresholding parameter.

artificial intelligence, estimator, machine learning, (16 more...)

arXiv.org Machine Learning

1308.3201

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.90)

Add feedback

Multithreshold Entropy Linear Classifier

Czarnecki, Wojciech Marian, Tabor, Jacek

arXiv.org Machine LearningAug-4-2014

Linear classifiers separate the data with a hyperplane. In this paper we focus on the novel method of construction of multithreshold linear classifier, which separates the data with multiple parallel hyperplanes. Proposed model is based on the information theory concepts -- namely Renyi's quadratic entropy and Cauchy-Schwarz divergence. We begin with some general properties, including data scale invariance. Then we prove that our method is a multithreshold large margin classifier, which shows the analogy to the SVM, while in the same time works with much broader class of hypotheses. What is also interesting, proposed method is aimed at the maximization of the balanced quality measure (such as Matthew's Correlation Coefficient) as opposed to very common maximization of the accuracy. This feature comes directly from the optimization problem statement and is further confirmed by the experiments on the UCI datasets. It appears, that our Multithreshold Entropy Linear Classifier (MELC) obtaines similar or higher scores than the ones given by SVM on both synthetic and real data. We show how proposed approach can be benefitial for the cheminformatics in the task of ligands activity prediction, where despite better classification results, MELC gives some additional insight into the data structure (classes of underrepresented chemical compunds).

artificial intelligence, classifier, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1016/j.eswa.2015.03.007

1408.1054

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)

Add feedback

Functional Principal Component Analysis and Randomized Sparse Clustering Algorithm for Medical Image Analysis

Lin, Nan, Jiang, Junhai, Guo, Shicheng, Xiong, Momiao

arXiv.org Artificial IntelligenceAug-1-2014

Due to advances in sensors, growing large and complex medical image data have the ability to visualize the pathological change in the cellular or even the molecular level or anatomical changes in tissues and organs. As a consequence, the medical images have the potential to enhance diagnosis of disease, prediction of clinical outcomes, characterization of disease progression, management of health care and development of treatments, but also pose great methodological and computational challenges for representation and selection of features in image cluster analysis. To address these challenges, we first extend one dimensional functional principal component analysis to the two dimensional functional principle component analyses (2DFPCA) to fully capture space variation of image signals. Image signals contain a large number of redundant and irrelevant features which provide no additional or no useful information for cluster analysis. Widely used methods for removing redundant and irrelevant features are sparse clustering algorithms using a lasso-type penalty to select the features. However, the accuracy of clustering using a lasso-type penalty depends on how to select penalty parameters and a threshold for selecting features. In practice, they are difficult to determine. Recently, randomized algorithms have received a great deal of attention in big data analysis. This paper presents a randomized algorithm for accurate feature selection in image cluster analysis. The proposed method is applied to ovarian and kidney cancer histology image data from the TCGA database. The results demonstrate that the randomized feature selection method coupled with functional principal component analysis substantially outperforms the current sparse clustering algorithms in image cluster analysis.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1371/journal.pone.0132945

1408.0204

Country: North America > United States > Texas (0.29)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.90)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Thurstonian Boltzmann Machines: Learning from Multiple Inequalities

Tran, Truyen, Phung, Dinh, Venkatesh, Svetha

arXiv.org Machine LearningJul-31-2014

Restricted Boltzmann machines (RBMs) have proved to be a versatile tool for a wide variety of machine learning tasks and as a building block for deep architectures [12, 24, 28]. The original proposals mainly handle binary visible and hidden units. Whilst binary hidden units are broadly applicable as feature detectors, non-binary visible data requires different designs. Recent extensions to other data types result in type-dependent models: the Gaussian for continuous inputs [12], Beta for bounded continuous inputs [16], Poisson for count data [9], multinomial for unordered categories [25], and ordinal models for ordered categories [37, 35]. The Boltzmann distribution permits several types to be jointly modelled, thus making the RBM a good tool for multimodal and complex social survey analysis. The work of [20, 29, 40] combines continuous (e.g., visual and audio) and discrete modalities (e.g., words). The work of [34] extends the idea further to incorporate ordinal and rank data. However, there are conceptual drawbacks: First, conditioned on the hidden layer, they are still separate type-specific models; second, handling ordered categories and ranks is not natural; and third, specifying direct correlation between these types remains difficult. The main thesis of this paper is that many data types can be captured in one unified model.

artificial intelligence, constraint, machine learning, (17 more...)

arXiv.org Machine Learning

1408.0055

Country:

Asia (1.00)
Europe (0.68)
North America > United States (0.67)
Africa (0.67)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.72)

Add feedback

Learning From Ordered Sets and Applications in Collaborative Ranking

Tran, Truyen, Phung, Dinh, Venkatesh, Svetha

arXiv.org Machine LearningJul-31-2014

Ranking over sets arise when users choose between groups of items. For example, a group may be of those movies deemed $5$ stars to them, or a customized tour package. It turns out, to model this data type properly, we need to investigate the general combinatorics problem of partitioning a set and ordering the subsets. Here we construct a probabilistic log-linear model over a set of ordered subsets. Inference in this combinatorial space is highly challenging: The space size approaches $(N!/2)6.93145^{N+1}$ as $N$ approaches infinity. We propose a \texttt{split-and-merge} Metropolis-Hastings procedure that can explore the state-space efficiently. For discovering hidden aspects in the data, we enrich the model with latent binary variables so that the posteriors can be efficiently evaluated. Finally, we evaluate the proposed model on large-scale collaborative filtering tasks and demonstrate that it is competitive against state-of-the-art methods.

artificial intelligence, machine learning, subset, (16 more...)

arXiv.org Machine Learning

1408.0043

Country: North America > United States (0.68)

Genre: Research Report (0.84)

Industry:

Leisure & Entertainment (0.69)
Media > Film (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.35)

Add feedback

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Fisher, Charles K., Mehta, Pankaj

arXiv.org Machine LearningJul-30-2014

Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach -- the Bayesian Ising Approximation (BIA) -- to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high dimensional regression by analyzing a gene expression dataset with nearly 30,000 features.

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Machine Learning

1407.8187

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)

Add feedback

Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases

Yu, Fei, Rybar, Michal, Uhler, Caroline, Fienberg, Stephen E.

arXiv.org Machine LearningJul-30-2014

Following the publication of an attack on genome-wide association studies (GWAS) data proposed by Homer et al., considerable attention has been given to developing methods for releasing GWAS data in a privacy-preserving way. Here, we develop an end-to-end differentially private method for solving regression problems with convex penalty functions and selecting the penalty parameters by cross-validation. In particular, we focus on penalized logistic regression with elastic-net regularization, a method widely used to in GWAS analyses to identify disease-causing genes. We show how a differentially private procedure for penalized logistic regression with elastic-net regularization can be applied to the analysis of GWAS data and evaluate our method's performance.

artificial intelligence, logistic regression, machine learning, (16 more...)

arXiv.org Machine Learning

1407.8067

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
Information Technology > Security & Privacy (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

Targeting Optimal Active Learning via Example Quality

Evans, Lewis P. G., Adams, Niall M., Anagnostopoulos, Christoforos

arXiv.org Machine LearningJul-30-2014

In many classification problems unlabelled data is abundant and a subset can be chosen for labelling. This defines the context of active learning (AL), where methods systematically select that subset, to improve a classifier by retraining. Given a classification problem, and a classifier trained on a small number of labelled examples, consider the selection of a single further example. This example will be labelled by the oracle and then used to retrain the classifier. This example selection raises a central question: given a fully specified stochastic description of the classification problem, which example is the optimal selection? If optimality is defined in terms of loss, this definition directly produces expected loss reduction (ELR), a central quantity whose maximum yields the optimal example selection. This work presents a new theoretical approach to AL, example quality, which defines optimal AL behaviour in terms of ELR. Once optimal AL behaviour is defined mathematically, reasoning about this abstraction provides insights into AL. In a theoretical context the optimal selection is compared to existing AL methods, showing that heuristics can make sub-optimal selections. Algorithms are constructed to estimate example quality directly. A large-scale experimental study shows these algorithms to be competitive with standard AL methods.

artificial intelligence, classifier, machine learning, (14 more...)

arXiv.org Machine Learning

1407.8042

Country:

Europe > United Kingdom (0.28)
North America > United States (0.28)

Genre:

Research Report > New Finding (0.89)
Research Report > Experimental Study (0.89)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback