Clustering
Artificial Intelligence for Diabetes Case Management: The Intersection of Physical and Mental Health
The work - and the opinions expressed herein - are those of the author, and do not necessarily reflect the views of CVS Health, Indiana University, or their affiliates, nor those of any of the author's previous affiliations or collaborators. Abstract Objective: Diabetes is a major public health problem in the United States, affecting roughly 30 million people. Diabetes complications, along with the mental health comorbidities that often cooccur with them, are major drivers of high healthcare costs, poor outcomes, and reduced treatment adherence in diabetes. Here, we evaluate in a large statewide population whether we can use artificial intelligence (AI) techniques to identify clusters of patient trajectories within the broader diabetes population in order to create cost-effective, narrowly-focused case management intervention strategies to reduce development of complications. Methods: This approach combined data from: 1) claims, 2) case management notes, and 3) social determinants of health from 300,000 real patients between 2014 and 2016. We categorized complications as five types: Cardiovascular, Neuropathy, Opthalmic, Renal, and Other. Modeling was performed combining a variety of machine learning algorithms, including supervised classification, unsupervised clustering, natural language processing of unstructured care notes, and feature engineering. Results: The results showed that we can predict development of diabetes complications roughly 83.5% of the time using claims data or social determinants of health data. They also showed we can reveal meaningful clusters in the patient population related to complications and mental health that can be used to design a cost-effective screening program, reducing the number of patients to be screened down by 85%. Conclusion: This study outlines creation of an AI framework to develop protocols to better address mental health comorbidities that lead to complications development in the diabetes population.
Anomaly Detection with K-Means Clustering
To explore anomaly detection, we'll be using an EKG data set from PhysioNet, which is essentially the squishy version of the data we'll be getting from servers. Since this data has a very regular waveform, it provides a good vehicle for us to explore the algorithms without getting bogged down in the complications that come with real-world data. The data is supplied in the a02.dat file. A Python module ekg_data.py is provided to read the data. Let's get started by importing the EKG data module and examining what the data looks like:
Building a language evolution tree based on word vector combination model
Gao, Zhu, Jiang, Yanhui, Gao, Junhui
In this paper, we try to explore the evolution of language through case calculations. First, we chose the novels of eleven British writers from 1400 to 2005 and found the corresponding works; Then, we use the natural language processing tool to construct the corresponding eleven corpora, and calculate the respective word vectors of 100 high-frequency words in eleven corpora; Next, for each corpus, we concatenate the 100 word vectors from beginning to end into one; Finally, we use the similarity comparison and hierarchical clustering method to generate the relationship tree between the combined eleven word vectors. This tree represents the relationship between eleven corpora. We found that in the tree generated by clustering, the distance between the corpus and the year corresponding to the corpus are basically the same. This means that we have discovered a specific language evolution tree. To verify the stability and versatility of this method, we add three other themes: Dickens's eight works, the 19th century poets' works, and art criticism of recent 60 years. For these four themes, we tested different parameters such as the time span of the corpus, the time interval between the corpora, the dimension of the word vector, and the number of high-frequency public words. The results show that this is fairly stable and versatile.
Learning Finer-class Networks for Universal Representations
Girard, Julien, Tamaazousti, Youssef, Borgne, Hervé Le, Hudelot, Céline
Many real-world visual recognition use-cases can not directly benefit from state-of-the-art CNN-based approaches because of the lack of many annotated data. The usual approach to deal with this is to transfer a representation pre-learned on a large annotated source-task onto a target-task of interest. This raises the question of how well the original representation is "universal", that is to say directly adapted to many different target-tasks. To improve such universality, the state-of-the-art consists in training networks on a diversified source problem, that is modified either by adding generic or specific categories to the initial set of categories. In this vein, we proposed a method that exploits finer-classes than the most specific ones existing, for which no annotation is available. We rely on unsupervised learning and a bottom-up split and merge strategy. We show that our method learns more universal representations than state-of-the-art, leading to significantly better results on 10 target-tasks from multiple domains, using several network architectures, either alone or combined with networks learned at a coarser semantic level.
Real-time Clustering Algorithm Based on Predefined Level-of-Similarity
Lamsal, Rabindra, Katiyar, Shubham
This paper proposes a centroid-based clustering algorithm which is capable of clustering data-points with n-features in real-time, without having to specify the number of clusters to be formed. We present the core logic behind the algorithm, a similarity measure, which collectively decides whether to assign an incoming data-point to a preexisting cluster, or create a new cluster & assign the data-point to it. The implementation of the proposed algorithm clearly demonstrates how efficiently a data-point with high dimensionality of features is assigned to an appropriate cluster with minimal operations. The proposed algorithm is very application specific and is applicable when the need is perform clustering analysis of real-time data-points, where the similarity measure between an incoming datapoint and the cluster to which the data-point is to be associated with, is greater than predefined Level-of-Similarity. Keywords: Unsupervised learning, Clustering, Centroid, Real-time, Level-of-Similarity
Disambiguating Music Artists at Scale with Audio Metric Learning
Royo-Letelier, Jimena, Hennequin, Romain, Tran, Viet-Anh, Moussallam, Manuel
ABSTRACT We address the problem of disambiguating large scale catalogs through the definition of an unknown artist clustering task. We explore the use of metric learning techniques to learn artist embeddings directly from audio, and using a dedicated homonym artists dataset, we compare our method with a recent approach that learn similar embeddings using artist classifiers. While both systems have the ability to disambiguate unknown artists relying exclusively on audio, we show that our system is more suitable in the case when enough audio data is available for each artist in the train dataset. We also propose a new negative sampling method for metric learning that takes advantage of side information such as music genre during the learning phase and shows promising results for the artist clustering task. 1. INTRODUCTION 1.1 Motivation With contemporary online music catalogs typically proposing dozens of millions of recordings, a major problem is the lack of an universal and reliable mean to identify music artists. Contrarily to albums' and tracks' ISRC As a direct consequence, the name of an artist remains its defacto identifier in practice although it results in common ambiguity issues. For example, name artist collisions (e.g. Bill Evans is the name of a jazz pianist but also the name of a jazz saxophonist and the name of a blackgrass banjo player) or artist aliases (e.g. Youssou N'Dour vs. Youssou Ndour, Simon & Garfunkel vs Paul Simon and Art Garfunkel, Cat Stevens vs Yusuf Islam) are usual.
Interpreting Layered Neural Networks via Hierarchical Modular Representation
Interpreting the prediction mechanism of complex models is currently one of the most important tasks in the machine learning field, especially with layered neural networks, which have achieved high predictive performance with various practical data sets. To reveal the global structure of a trained neural network in an interpretable way, a series of clustering methods have been proposed, which decompose the units into clusters according to the similarity of their inference roles. The main problems in these studies were that (1) we have no prior knowledge about the optimal resolution for the decomposition, or the appropriate number of clusters, and (2) there was no method with which to acquire knowledge about whether the outputs of each cluster have a positive or negative correlation with the input and output dimension values. In this paper, to solve these problems, we propose a method for obtaining a hierarchical modular representation of a layered neural network. The application of a hierarchical clustering method to a trained network reveals a tree-structured relationship among hidden layer units, based on their feature vectors defined by their correlation with the input and output dimension values.
Hierarchical community detection by recursive bi-partitioning
Li, Tianxi, Bhattacharyya, Sharmodeep, Sarkar, Purnamrita, Bickel, Peter J., Levina, Elizaveta
The problem of community detection in networks is usually formulated as finding a single partition of the network into some "correct" number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive bi-partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. Such an algorithm is model-free, extremely fast, and requires no tuning other than selecting a stopping rule. We show that there are regimes where it outperforms $K$-way spectral clustering, and propose a natural model for analyzing the algorithm's theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We also apply the algorithm to a dataset of statistics papers to construct a hierarchical tree of statistical research communities.
Accelerated Training of Large-Scale Gaussian Mixtures by a Merger of Sublinear Approaches
Hirschberger, Florian, Forster, Dennis, Lücke, Jörg
We combine two recent lines of research on sublinear clustering to significantly increase the efficiency of training Gaussian mixture models (GMMs) on large scale problems. First, we use a novel truncated variational EM approach for GMMs with isotropic Gaussians in order to increase clustering efficiency for large $C$ (many clusters). Second, we use recent coreset approaches to increase clustering efficiency for large $N$ (many data points). In order to derive a novel accelerated algorithm, we first show analytically how variational EM and coreset objectives can be merged to give rise to a new, combined clustering objective. Each iteration of the novel algorithm derived from this merged objective is then shown to have a run-time cost of $\mathcal{O}(N' G^2 D)$ per iteration, where $N'
Vector Quantized Spectral Clustering applied to Soybean Whole Genome Sequences
Shastri, Aditya A., Ahuja, Kapil, Ratnaparkhe, Milind B., Shah, Aditya, Gagrani, Aishwary, Lal, Anant
We develop a Vector Quantized Spectral Clustering (VQSC) algorithm that is a combination of Spectral Clustering (SC) and Vector Quantization (VQ) sampling for grouping Soybean genomes. The inspiration here is to use SC for its accuracy and VQ to make the algorithm computationally cheap (the complexity of SC is cubic in-terms of the input size). Although the combination of SC and VQ is not new, the novelty of our work is in developing the crucial similarity matrix in SC as well as use of k-medoids in VQ, both adapted for the Soybean genome data. We compare our approach with commonly used techniques like UPGMA (Un-weighted Pair Graph Method with Arithmetic Mean) and NJ (Neighbour Joining). Experimental results show that our approach outperforms both these techniques significantly in terms of cluster quality (up to 25% better cluster quality) and time complexity (order of magnitude faster).