AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Diabetes subtypes classification for personalized health care: A review - Artificial Intelligence Review

#artificialintelligenceAug-5-2022, 12:15:19 GMT

Healthcare is evolving from standard to personalized, driven by the patients' needs. Personalized healthcare is a medical model based on genetics, genomics, and other biological information that helps to predict risk for disease. To date, machine learning and data mining are the fastest-growing healthcare field used to classify patient cohorts from a large dataset and its application for diabetes subtyping will be a breakthrough. In this review paper, we have identified, analyzed, and summarized how previous studies distinguished diabetes into subtypes besides implementing the methods for diabetes subtyping using data mining and various clustering algorithms. We have discovered that many studies have suggested diabetes can be differentiated into subtypes clinically based on the risk complications, genetically defined, using clinical features, and for treatment selection.

artificial intelligence review, diabetes, diabetes subtype classification, (3 more...)

#artificialintelligence

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.75)

Add feedback

Chronological Self-Training for Real-Time Speaker Diarization

Padfield, Dirk, Liebling, Daniel J.

arXiv.org Artificial IntelligenceAug-5-2022

Diarization partitions an audio stream into segments based on the voices of the speakers. Real-time diarization systems that include an enrollment step should limit enrollment training samples to reduce user interaction time. Although training on a small number of samples yields poor performance, we show that the accuracy can be improved dramatically using a chronological self-training approach. We studied the tradeoff between training time and classification performance and found that 1 second is sufficient to reach over 95% accuracy. We evaluated on 700 audio conversation files of about 10 minutes each from 6 different languages and demonstrated average diarization error rates as low as 10%.

accuracy, algorithm, diarization, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2021-822

2208.03393

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.50)

Industry:

Media (0.48)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

Improving Fuzzy-Logic based Map-Matching Method with Trajectory Stay-Point Detection

Jafarlou, Minoo, E., Omid Mahdi Ebadati, Naderi, Hassan

arXiv.org Artificial IntelligenceAug-4-2022

The requirement to trace and process moving objects in the contemporary era gradually increases since numerous applications quickly demand precise moving object locations. The Map-matching method is employed as a preprocessing technique, which matches a moving object point on a corresponding road. However, most of the GPS trajectory datasets include stay-points irregularity, which makes map-matching algorithms mismatch trajectories to irrelevant streets. Therefore, determining the stay-point region in GPS trajectory datasets results in better accurate matching and more rapid approaches. In this work, we cluster stay-points in a trajectory dataset with DBSCAN and eliminate redundant data to improve the efficiency of the map-matching algorithm by lowering processing time. We reckoned our proposed method's performance and exactness with a ground truth dataset compared to a fuzzy-logic based map-matching algorithm. Fortunately, our approach yields 27.39% data size reduction and 8.9% processing time reduction with the same accurate results as the previous fuzzy-logic based map-matching approach.

algorithm, map-matching algorithm, trajectory, (16 more...)

arXiv.org Artificial Intelligence

2208.02881

Country: Europe > United Kingdom > England (0.04)

Genre:

Workflow (0.46)
Research Report (0.40)

Industry:

Transportation > Infrastructure & Services (0.49)
Transportation > Ground > Road (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

An Introduction to Graph Partitioning Algorithms and Community Detection

#artificialintelligenceAug-3-2022, 05:50:28 GMT

Graph partitioning has been a long-lasting problem and has a wide range of applications. This post shares the methodology for graph partitioning with both theoretical explanations and practical implementations of some popular graph partitioning algorithms with python codes. "Clustering" can be confusing under different contexts. In this article, clustering means node clustering, i.e. partitioning the graphs into clusters (or communities). We use graph partitioning, (node) clustering, and community detection interchangeably. In other words, we do not consider overlapping communities anywhere in this article.

algorithm, graph, node, (12 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Are Cluster Validity Measures (In)valid?

Gagolewski, Marek, Bartoszuk, Maciej, Cena, Anna

arXiv.org Artificial IntelligenceAug-2-2022

Internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping with regards to, say, the Silhouette index really meaningful? It turns out that many cluster (in)validity indices promote clusterings that match expert knowledge quite poorly. We also introduce a new, well-performing variant of the Dunn index that is built upon OWA operators and the near-neighbour graph so that subspaces of higher density, regardless of their shapes, can be separated from each other better.

artificial intelligence, evolutionary algorithm, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.ins.2021.10.004

2208.01261

Country:

Europe > Poland > Masovia Province > Warsaw (0.04)
Oceania > Australia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Instructional Material (0.87)
Research Report > New Finding (0.67)
Research Report > Experimental Study (0.67)

Industry: Education (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Enabling scalable clinical interpretation of ML-based phenotypes using real world data

Parsons, Owen, Barlow, Nathan E, Baxter, Janie, Paraschin, Karen, Derix, Andrea, Hein, Peter, Dürichen, Robert

arXiv.org Artificial IntelligenceAug-2-2022

The availability of large and deep electronic healthcare records (EHR) datasets has the potential to enable a better understanding of real-world patient journeys, and to identify novel subgroups of patients. ML-based aggregation of EHR data is mostly tool-driven, i.e., building on available or newly developed methods. However, these methods, their input requirements, and, importantly, resulting output are frequently difficult to interpret, especially without in-depth data science or statistical training. This endangers the final step of analysis where an actionable and clinically meaningful interpretation is needed.This study investigates approaches to perform patient stratification analysis at scale using large EHR datasets and multiple clustering methods for clinical research. We have developed several tools to facilitate the clinical evaluation and interpretation of unsupervised patient stratification results, namely pattern screening, meta clustering, surrogate modeling, and curation. These tools can be used at different stages within the analysis. As compared to a standard analysis approach, we demonstrate the ability to condense results and optimize analysis time. In the case of meta clustering, we demonstrate that the number of patient clusters can be reduced from 72 to 3 in one example. In another stratification result, by using surrogate models, we could quickly identify that heart failure patients were stratified if blood sodium measurements were available. As this is a routine measurement performed for all patients with heart failure, this indicated a data bias. By using further cohort and feature curation, these patients and other irrelevant features could be removed to increase the clinical meaningfulness. These examples show the effectiveness of the proposed methods and we hope to encourage further research in this field.

artificial intelligence, machine learning, surrogate model, (19 more...)

arXiv.org Artificial Intelligence

2208.01607

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Maximal Independent Vertex Set applied to Graph Pooling

Stanovic, Stevan, Gaüzère, Benoit, Brun, Luc

arXiv.org Artificial IntelligenceAug-2-2022

Convolutional neural networks (CNN) have enabled major advances in image classification through convolution and pooling. In particular, image pooling transforms a connected discrete grid into a reduced grid with the same connectivity and allows reduction functions to take into account all the pixels of an image. However, a pooling satisfying such properties does not exist for graphs. Indeed, some methods are based on a vertex selection step which induces an important loss of information. Other methods learn a fuzzy clustering of vertex sets which induces almost complete reduced graphs. We propose to overcome both problems using a new pooling method, named MIVSPool. This method is based on a selection of vertices called surviving vertices using a Maximal Independent Vertex Set (MIVS) and an assignment of the remaining vertices to the survivors. Consequently, our method does not discard any vertex information nor artificially increase the density of the graph. Experimental results show an increase in accuracy for graph classification on various standard datasets.

graph, maximal independent vertex set, vertex, (15 more...)

arXiv.org Artificial Intelligence

2208.01648

Country: Europe > France > Normandy > Seine-Maritime > Rouen (0.05)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

Silva, Marília Costa Rosendo, Siqueira, Felipe Alves, Tarrega, João Pedro Mantovani, Beinotti, João Vitor Pataca, Nunes, Augusto Sousa, Gardini, Miguel de Mattos, da Silva, Vinícius Adolfo Pereira, da Silva, Nádia Félix Felipe, de Carvalho, André Carlos Ponce de Leon Ferreira

arXiv.org Artificial IntelligenceAug-2-2022

Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works.

algorithm, computational linguistic, reproducibility and distortion issue, (11 more...)

arXiv.org Artificial Intelligence

2208.01712

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(36 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.47)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(3 more...)

Add feedback

A Tighter Analysis of Spectral Clustering, and Beyond

Macgregor, Peter, Sun, He

arXiv.org Artificial IntelligenceAug-2-2022

This work studies the classical spectral clustering algorithm which embeds the vertices of some graph $G=(V_G, E_G)$ into $\mathbb{R}^k$ using $k$ eigenvectors of some matrix of $G$, and applies $k$-means to partition $V_G$ into $k$ clusters. Our first result is a tighter analysis on the performance of spectral clustering, and explains why it works under some much weaker condition than the ones studied in the literature. For the second result, we show that, by applying fewer than $k$ eigenvectors to construct the embedding, spectral clustering is able to produce better output for many practical instances; this result is the first of its kind in spectral clustering. Besides its conceptual and theoretical significance, the practical impact of our work is demonstrated by the empirical analysis on both synthetic and real-world datasets, in which spectral clustering produces comparable or better results with fewer than $k$ eigenvectors.

eigenvector, graph, spectral, (13 more...)

arXiv.org Artificial Intelligence

2208.01724

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Cluster Weighted Model Based on TSNE algorithm for High-Dimensional Data

Olobatuyi, Kehinde

arXiv.org Artificial IntelligenceAug-2-2022

Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of "Curse of dimensionality" on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the "FlexCWM" R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.

cwm, high-dimensional data, parsimonious model, (15 more...)

arXiv.org Artificial Intelligence

2208.01579

Country:

Oceania > Australia > Tasmania > Hobart (0.04)
North America > United States > New York (0.04)
Indian Ocean > Bass Strait (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback