CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering

arXiv.org Machine Learning

Feature selection is an important and challenging task in high dimensional clustering. For example, in genomics, there may only be a small number of genes that are differentially expressed, which are informative to the overall clustering structure. Existing feature selection methods, such as Sparse K-means, rarely tackle the problem of accounting features that can only separate a subset of clusters. In genomics, it is highly likely that a gene can only define one subtype against all the other subtypes or distinguish a pair of subtypes but not others. In this paper, we propose a K-means based clustering algorithm that discovers informative features as well as which cluster pairs are separable by each selected features. The method is essentially an EM algorithm, in which we introduce lasso-type constraints on each cluster pair in the M step, and make the E step possible by maximizing the raw cross-cluster distance instead of minimizing the intra-cluster distance. The results were demonstrated on simulated data and a leukemia gene expression dataset.

Exploring the Characterization and Classification of EEG Signals for a Computer-Aided Epilepsy Diagnosis System


Epilepsy occurs when localized electrical activity of neurons suffer from an imbalance. One of the most adequate methods for diagnosing and monitoring is via the analysis of electroencephalographic (EEG) signals. Despite there is a wide range of alternatives to characterize and classify EEG signals for epilepsy analysis purposes, many key aspects related to accuracy and physiological interpretation are still considered as open issues. In this paper, this work performs an exploratory study in order to identify the most adequate frequently-used methods for characterizing and classifying epileptic seizures. In this regard, a comparative study is carried out on several subsets of features using four representative classifiers: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM).

Design of one-year mortality forecast at hospital admission based: a machine learning approach

arXiv.org Machine Learning

Background: Palliative care is referred to a set of programs for patients that suffer life-limiting illnesses. These programs aim to guarantee a minimum level of quality of life (QoL) for the last stage of life. They are currently based on clinical evaluation of risk of one-year mortality. Objectives: The main objective of this work is to develop and validate machine-learning based models to predict the exitus of a patient within the next year using data gathered at hospital admission. Methods: Five machine learning techniques were applied in our study to develop machine-learning predictive models: Support Vector Machines, K-neighbors Classifier, Gradient Boosting Classifier, Random Forest and Multilayer Perceptron. All models were trained and evaluated using the retrospective dataset. The evaluation was performed with five metrics computed by a resampling strategy: Accuracy, the area under the ROC curve, Specificity, Sensitivity, and the Balanced Error Rate. Results: All models for forecasting one-year mortality achieved an AUC ROC from 0.858 to 0.911. Specifically, Gradient Boosting Classifier was the best model, producing an AUC ROC of 0.911 (CI 95%, 0.911 to 0.912), a sensitivity of 0.858 (CI 95%, 0.856 to 0.86) and a specificity of 0.807 (CI 95%, 0.806 to 0808) and a BER of 0.168 (CI 95%, 0.167 to 0.169). Conclusions: The analysis of common information at hospital admission combined with machine learning techniques produced models with competitive discriminative power. Our models reach the best results reported in state of the art. These results demonstrate that they can be used as an accurate data-driven palliative care criteria inclusion.

Sparse Instrumental Variables (SPIV) for Genome-Wide Studies

Neural Information Processing Systems

This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modified here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as ``instruments'' to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for fine mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with $\sim 10^5$ genetic instruments and $\sim 47 \times 10^3$ gene transcripts.


AAAI Conferences

Patients with diabetes must continually monitor their blood glucose levels and adjust insulin doses, striving to keep blood glucose levels as close to normal as possible. Blood glucose levels that deviate from the normal range can lead to serious short-term and long-term complications. An automatic prediction model that warned people of imminent changes in their blood glucose levels would enable them to take preventive action. In this paper, we describe a solution that uses a generic physiological model of blood glucose dynamics to generate informative features for a Support Vector Regression model that is trained on patient specific data. The new model outperforms diabetes experts at predicting blood glucose levels and could be used to anticipate almost a quarter of hypoglycemic events 30 minutes in advance. Although the corresponding precision is currently just 42%, most false alarms are in near-hypoglycemic regions and therefore patients responding to these hypoglycemia alerts would not be harmed by intervention.