CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering

arXiv.org Machine Learning

Feature selection is an important and challenging task in high dimensional clustering. For example, in genomics, there may only be a small number of genes that are differentially expressed, which are informative to the overall clustering structure. Existing feature selection methods, such as Sparse K-means, rarely tackle the problem of accounting features that can only separate a subset of clusters. In genomics, it is highly likely that a gene can only define one subtype against all the other subtypes or distinguish a pair of subtypes but not others. In this paper, we propose a K-means based clustering algorithm that discovers informative features as well as which cluster pairs are separable by each selected features. The method is essentially an EM algorithm, in which we introduce lasso-type constraints on each cluster pair in the M step, and make the E step possible by maximizing the raw cross-cluster distance instead of minimizing the intra-cluster distance. The results were demonstrated on simulated data and a leukemia gene expression dataset.

Exploring the Characterization and Classification of EEG Signals for a Computer-Aided Epilepsy Diagnosis System


Epilepsy occurs when localized electrical activity of neurons suffer from an imbalance. One of the most adequate methods for diagnosing and monitoring is via the analysis of electroencephalographic (EEG) signals. Despite there is a wide range of alternatives to characterize and classify EEG signals for epilepsy analysis purposes, many key aspects related to accuracy and physiological interpretation are still considered as open issues. In this paper, this work performs an exploratory study in order to identify the most adequate frequently-used methods for characterizing and classifying epileptic seizures. In this regard, a comparative study is carried out on several subsets of features using four representative classifiers: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM).

Design of one-year mortality forecast at hospital admission based: a machine learning approach

arXiv.org Machine Learning

Background: Palliative care is referred to a set of programs for patients that suffer life-limiting illnesses. These programs aim to guarantee a minimum level of quality of life (QoL) for the last stage of life. They are currently based on clinical evaluation of risk of one-year mortality. Objectives: The main objective of this work is to develop and validate machine-learning based models to predict the exitus of a patient within the next year using data gathered at hospital admission. Methods: Five machine learning techniques were applied in our study to develop machine-learning predictive models: Support Vector Machines, K-neighbors Classifier, Gradient Boosting Classifier, Random Forest and Multilayer Perceptron. All models were trained and evaluated using the retrospective dataset. The evaluation was performed with five metrics computed by a resampling strategy: Accuracy, the area under the ROC curve, Specificity, Sensitivity, and the Balanced Error Rate. Results: All models for forecasting one-year mortality achieved an AUC ROC from 0.858 to 0.911. Specifically, Gradient Boosting Classifier was the best model, producing an AUC ROC of 0.911 (CI 95%, 0.911 to 0.912), a sensitivity of 0.858 (CI 95%, 0.856 to 0.86) and a specificity of 0.807 (CI 95%, 0.806 to 0808) and a BER of 0.168 (CI 95%, 0.167 to 0.169). Conclusions: The analysis of common information at hospital admission combined with machine learning techniques produced models with competitive discriminative power. Our models reach the best results reported in state of the art. These results demonstrate that they can be used as an accurate data-driven palliative care criteria inclusion.

Sparse Instrumental Variables (SPIV) for Genome-Wide Studies

Neural Information Processing Systems

This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modified here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as ``instruments'' to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for fine mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with $\sim 10^5$ genetic instruments and $\sim 47 \times 10^3$ gene transcripts.

Learning to Exploit Invariances in Clinical Time-Series Data using Sequence Transformer Networks

arXiv.org Machine Learning

Recently, researchers have started applying convolutional neural networks (CNNs) with one-dimensional convolutions to clinical tasks involving time-series data. This is due, in part, to their computational efficiency, relative to recurrent neural networks and their ability to efficiently exploit certain temporal invariances, (e.g., phase invariance). However, it is well-established that clinical data may exhibit many other types of invariances (e.g., scaling). While preprocessing techniques, (e.g., dynamic time warping) may successfully transform and align inputs, their use often requires one to identify the types of invariances in advance. In contrast, we propose the use of Sequence Transformer Networks, an end-to-end trainable architecture that learns to identify and account for invariances in clinical time-series data. Applied to the task of predicting in-hospital mortality, our proposed approach achieves an improvement in the area under the receiver operating characteristic curve (AUROC) relative to a baseline CNN (AUROC=0.851 vs. AUROC=0.838). Our results suggest that a variety of valuable invariances can be learned directly from the data.