AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Noisy $\ell^{0}$-Sparse Subspace Clustering on Dimensionality Reduced Data

Yang, Yingzhen, Li, Ping

arXiv.org Machine LearningJun-22-2022

Sparse subspace clustering methods with sparsity induced by $\ell^{0}$-norm, such as $\ell^{0}$-Sparse Subspace Clustering ($\ell^{0}$-SSC)~\citep{YangFJYH16-L0SSC-ijcv}, are demonstrated to be more effective than its $\ell^{1}$ counterpart such as Sparse Subspace Clustering (SSC)~\citep{ElhamifarV13}. However, the theoretical analysis of $\ell^{0}$-SSC is restricted to clean data that lie exactly in subspaces. Real data often suffer from noise and they may lie close to subspaces. In this paper, we show that an optimal solution to the optimization problem of noisy $\ell^{0}$-SSC achieves subspace detection property (SDP), a key element with which data from different subspaces are separated, under deterministic and semi-random model. Our results provide theoretical guarantee on the correctness of noisy $\ell^{0}$-SSC in terms of SDP on noisy data for the first time, which reveals the advantage of noisy $\ell^{0}$-SSC in terms of much less restrictive condition on subspace affinity. In order to improve the efficiency of noisy $\ell^{0}$-SSC, we propose Noisy-DR-$\ell^{0}$-SSC which provably recovers the subspaces on dimensionality reduced data. Noisy-DR-$\ell^{0}$-SSC first projects the data onto a lower dimensional space by random projection, then performs noisy $\ell^{0}$-SSC on the projected data for improved efficiency. Experimental results demonstrate the effectiveness of Noisy-DR-$\ell^{0}$-SSC.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2206.11079

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
(22 more...)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Topological data analysis of truncated contagion maps

Klimm, Florian

arXiv.org Artificial IntelligenceJun-20-2022

The investigation of dynamical processes on networks has been one focus for the study of contagion processes. It has been demonstrated that contagions can be used to obtain information about the embedding of nodes in a Euclidean space. Specifically, one can use the activation times of threshold contagions to construct contagion maps as a manifold-learning approach. One drawback of contagion maps is their high computational cost. Here, we demonstrate that a truncation of the threshold contagions may considerably speed up the construction of contagion maps. Finally, we show that contagion maps may be used to find an insightful low-dimensional embedding for single-cell RNA-sequencing data in the form of cell-similarity networks and so reveal biological manifolds. Overall, our work makes the use of contagion maps as manifold-learning approaches on empirical network data more viable.

contagion map, full contagion map, truncated contagion map, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1063/5.0090114

2203.0172

Country:

North America > United States (0.14)
Europe > Germany > Berlin (0.04)
Europe > United Kingdom (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.94)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Distribution Agnostic Symbolic Representations for Time Series Dimensionality Reduction and Online Anomaly Detection

Bountrogiannis, Konstantinos, Tzagkarakis, George, Tsakalides, Panagiotis

arXiv.org Artificial IntelligenceJun-6-2022

Due to the importance of the lower bounding distances and the attractiveness of symbolic representations, the family of symbolic aggregate approximations (SAX) has been used extensively for encoding time series data. However, typical SAX-based methods rely on two restrictive assumptions; the Gaussian distribution and equiprobable symbols. This paper proposes two novel data-driven SAX-based symbolic representations, distinguished by their discretization steps. The first representation, oriented for general data compaction and indexing scenarios, is based on the combination of kernel density estimation and Lloyd-Max quantization to minimize the information loss and mean squared error in the discretization step. The second method, oriented for high-level mining tasks, employs the Mean-Shift clustering method and is shown to enhance anomaly detection in the lower-dimensional space. Besides, we verify on a theoretical basis a previously observed phenomenon of the intrinsic process that results in a lower than the expected variance of the intermediate piecewise aggregate approximation. This phenomenon causes an additional information loss but can be avoided with a simple modification. The proposed representations possess all the attractive properties of the conventional SAX method. Furthermore, experimental evaluation on real-world datasets demonstrates their superiority compared to the traditional SAX and an alternative data-driven SAX variant.

dataset, representation, symbolic representation, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TKDE.2022.3174630

2105.09592

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.41)

Add feedback

Interpretable Models Capable of Handling Systematic Missingness in Imbalanced Classes and Heterogeneous Datasets

Ghosh, Sreejita, Baranowski, Elizabeth S., Biehl, Michael, Arlt, Wiebke, Tino, Peter, Bunte, Kerstin

arXiv.org Artificial IntelligenceJun-4-2022

Application of interpretable machine learning techniques on medical datasets facilitate early and fast diagnoses, along with getting deeper insight into the data. Furthermore, the transparency of these models increase trust among application domain experts. Medical datasets face common issues such as heterogeneous measurements, imbalanced classes with limited sample size, and missing data, which hinder the straightforward application of machine learning techniques. In this paper we present a family of prototype-based (PB) interpretable models which are capable of handling these issues. The models introduced in this contribution show comparable or superior performance to alternative techniques applicable in such situations. However, unlike ensemble based models, which have to compromise on easy interpretation, the PB models here do not. Moreover we propose a strategy of harnessing the power of ensembles while maintaining the intrinsic interpretability of the PB models, by averaging the model parameter manifolds. All the models were evaluated on a synthetic (publicly available dataset) in addition to detailed analyses of two real-world medical datasets (one publicly available). Results indicated that the models and strategies we introduced addressed the challenges of real-world medical data, while remaining computationally inexpensive and transparent, as well as similar or superior in performance compared to their alternatives.

artificial intelligence, dataset, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.neucom.2025.129405

2206.02056

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Netherlands (0.04)
North America > United States > New York (0.04)
(6 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.93)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)

Add feedback

All About K-Means Clustering

#artificialintelligenceMay-31-2022, 00:28:28 GMT

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. "Clustering is an unsupervised machine learning technique which finds certain patterns/structures in the unlabeled data to segregate them into different groups, according to their properties."

algorithm, centroid, k-means clustering, (7 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.97)

Add feedback

Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Afzal, Ayesha, Hager, Georg, Wellein, Gerhard, Markidis, Stefano

arXiv.org Artificial IntelligenceMay-27-2022

This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.

artificial intelligence, iteration, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-30442-2_12

2205.13963

Country:

North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Adding Explainability to Clustering - Analytics Vidhya

#artificialintelligenceMay-26-2022, 11:15:57 GMT

Clustering is an unsupervised algorithm that is used for determining the intrinsic groups present in unlabelled data. For instance, a B2C business might be interested in finding segments in its customer base. Clustering is hence used commonly for different use-cases like customer segmentation, market segmentation, pattern recognition, search result clustering etc. Some standard clustering techniques are K-means, DBSCAN, Hierarchical clustering amongst other methods. Clusters created using techniques like Kmeans are often not easy to decipher because it is difficult to determine why a particular row of data is classified in a particular bucket.

algorithm, clustering, decision tree, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)

Add feedback

Best Papers to Read on the Mean Shift Algorithm

#artificialintelligenceMay-25-2022, 01:25:04 GMT

Abstract: Two important nonparametric approaches to clustering emerged in the 1970's: clustering by level sets or cluster tree as proposed by Hartigan, and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hosteler. In a recent paper, we argue the thesis that these two approaches are fundamentally the same by showing that the gradient flow provides a way to move along the cluster tree. In making a stronger case, we are confronted with the fact the cluster tree does not define a partition of the entire support of the underlying density, while the gradient flow does. Abstract: Mean shift is a simple interactive procedure that gradually shifts data points towards the mode which denotes the highest density of data points in the region. Mean shift algorithms have been effectively used for data denoising, mode seeking, and finding the number of clusters in a dataset in an automated fashion.

algorithm, cell segmentation, construction method, (9 more...)

#artificialintelligence

Genre: Research Report (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.51)

Add feedback

K-Medoid Clustering (PAM)Algorithm in Python

#artificialintelligenceMay-22-2022, 16:15:22 GMT

Clustering of large-scale data is key to implementing segmentation-based algorithms. Segmentation can include identifying customer groups to facilitate targeted marketing, identifying prescriber groups to allow health care players to reach out to them with the right messaging, and identifying patterns or abnormal values in the data. K-Means is the most popular clustering algorithm adopted across different problem areas, mostly owing to its computational efficiency and ease of understanding the algorithm. K-Means relies on identifying cluster centers from the data. It alternates between assigning points to these cluster centers using the Euclidean distance metric and recomputes the cluster centers till a convergence criterion is achieved.

algorithm, cluster center, dissimilarity, (10 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.39)

Add feedback

Unsupervised Learning Algorithms in One Picture - DataScienceCentral.com

#artificialintelligenceMay-10-2022, 02:56:05 GMT

Unsupervised learning algorithms are "unsupervised" because you let them run without direct supervision. You feed the data into the algorithm, and the algorithm figures out the patterns. The following picture shows the differences between three of the most popular unsupervised learning algorithms: Principal Component Analysis, k-Means clustering and Hierarchical clustering. The three are closely related, because data clustering is a type of data reduction; PCA can be viewed as a continuous counterpart of K-Means (see Ding & He, 2004).

creativecommon, datasciencecentral, unsupervised learning algorithm, (1 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback