AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

More Discriminative Sentence Embeddings via Semantic Graph Smoothing

Fettal, Chakib, Labiod, Lazhar, Nadif, Mohamed

arXiv.org Artificial IntelligenceFeb-20-2024

Simplified versions of this Text categorization, also known as document categorization, deep architecture have been proposed wherein the is a natural language processing (NLP) learning of large sets of weights has been deemed task that involves arranging texts into coherent unnecessary. Their representation learning scheme groups based on their content. It has many applications works similar to Laplacian smoothing and, by extension, such as spam detection (Jindal and Liu, 2007), graph filtering. We can give as examples sentiment analysis (Melville et al., 2009), content of these simplified techniques the simple graph convolution recommendation (Pazzani and Billsus, 2007), etc. (SGC) (Wu et al., 2019), and the simple There are two main approaches to text categorization: spectral graph convolution (S GC) (Zhu and Koniusz, classification (supervised learning) and clustering 2020). Some researchers used GCNs for the (unsupervised learning).

classification, graph, representation, (14 more...)

arXiv.org Artificial Intelligence

2402.1289

Country: Europe > France > Île-de-France > Paris > Paris (0.05)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

CCFC++: Enhancing Federated Clustering through Feature Decorrelation

Yan, Jie, Liu, Jing, Ning, Yi-Zi, Zhang, Zhong-Yuan

arXiv.org Artificial IntelligenceFeb-20-2024

This field has seen notable advancements through its marriage with contrastive learning, exemplified by Cluster-Contrastive Federated Clustering (CCFC). However, CCFC suffers from heterogeneous data across clients, leading to poor and unrobust performance. Our study conducts both empirical and theoretical analyses to understand the impact of heterogeneous data on CCFC. Findings indicate that increased data heterogeneity exacerbates dimensional collapse in CCFC, evidenced by increased correlations across multiple dimensions of the learned representations. To address this, we introduce a decorrelation regularizer to CCFC. Benefiting from the regularizer, the improved method effectively mitigates the detrimental effects of data heterogeneity, and achieves superior performance, as evidenced by a marked increase in NMI scores, with the gain reaching as high as 0.32 in the most pronounced case.

data heterogeneity, representation, scenario, (12 more...)

arXiv.org Artificial Intelligence

2402.12852

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Evaluation of Country Dietary Habits Using Machine Learning Techniques in Relation to Deaths from COVID-19

García-Ordás, María Teresa, Arias, Natalia, Benavides, Carmen, García-Olalla, Oscar, Benítez-Andrades, José Alberto

arXiv.org Artificial IntelligenceFeb-19-2024

COVID-19 disease has affected almost every country in the world. The large number of infected people and the different mortality rates between countries has given rise to many hypotheses about the key points that make the virus so lethal in some places. In this study, the eating habits of 170 countries were evaluated in order to find correlations between these habits and mortality rates caused by COVID-19 using machine learning techniques that group the countries together according to the different distribution of fat, energy, and protein across 23 different types of food, as well as the amount ingested in kilograms. Results shown how obesity and the high consumption of fats appear in countries with the highest death rates, whereas countries with a lower rate have a higher level of cereal consumption accompanied by a lower total average intake of kilocalories.

covid-19, death, information, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/healthcare8040371

2402.12558

Country:

Asia > China > Hubei Province > Wuhan (0.05)
Europe > Spain > Castile and León > León Province > León (0.05)
Oceania > New Zealand (0.04)
(30 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Kernel KMeans clustering splits for end-to-end unsupervised decision trees

Ohl, Louis, Mattei, Pierre-Alexandre, Leclercq, Mickaël, Droit, Arnaud, Precioso, Frédéric

arXiv.org Machine LearningFeb-19-2024

Trees are convenient models for obtaining explainable predictions on relatively small datasets. Although there are many proposals for the end-to-end construction of such trees in supervised learning, learning a tree end-to-end for clustering without labels remains an open challenge. As most works focus on interpreting with trees the result of another clustering algorithm, we present here a novel end-to-end trained unsupervised binary tree for clustering: Kauri. This method performs a greedy maximisation of the kernel KMeans objective without requiring the definition of centroids. We compare this model on multiple datasets with recent unsupervised trees and show that Kauri performs identically when using a linear kernel. For other kernels, Kauri often outperforms the concatenation of kernel KMeans and a CART decision tree.

algorithm, dataset, kernel, (17 more...)

arXiv.org Machine Learning

2402.12232

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > El Salvador (0.04)
North America > Canada > Quebec (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Cluster Metric Sensitivity to Irrelevant Features

McCrory, Miles, Thomas, Spencer A.

arXiv.org Machine LearningFeb-19-2024

Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continually growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics as the data can be interrogated for many different purposes. This however leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabeled tasks. In this paper, we investigate the sensitivity of clustering performance noisy uncorrelated variables iteratively added to baseline datasets with well defined clusters. We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted rand index (ARI) and normalised mutual information (NMI) when the irrelevant features are Gaussian distributed. For Uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies-Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks.

dataset, dimensionality, random variable, (14 more...)

arXiv.org Machine Learning

2402.12008

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > South Africa (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Empirical Density Estimation based on Spline Quasi-Interpolation with applications to Copulas clustering modeling

Tamborrino, Cristiano, Falini, Antonella, Mazzia, Francesca

arXiv.org Machine LearningFeb-18-2024

Density estimation is a fundamental technique employed in various fields to model and to understand the underlying distribution of data. The primary objective of density estimation is to estimate the probability density function of a random variable. This process is particularly valuable when dealing with univariate or multivariate data and is essential for tasks such as clustering, anomaly detection, and generative modeling. In this paper we propose the mono-variate approximation of the density using spline quasi interpolation and we applied it in the context of clustering modeling. The clustering technique used is based on the construction of suitable multivariate distributions which rely on the estimation of the monovariate empirical densities (marginals). Such an approximation is achieved by using the proposed spline quasi-interpolation, while the joint distributions to model the sought clustering partition is constructed with the use of copulas functions. In particular, since copulas can capture the dependence between the features of the data independently from the marginal distributions, a finite mixture copula model is proposed. The presented algorithm is validated on artificial and real datasets.

algorithm, copula, estimation, (14 more...)

arXiv.org Machine Learning

2402.11552

Country:

North America > United States > Wisconsin (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Empirical and Experimental Insights into Data Mining Techniques for Crime Prediction: A Comprehensive Survey

Taha, Kamal

arXiv.org Artificial IntelligenceFeb-17-2024

This survey paper presents a comprehensive analysis of crime prediction methodologies, exploring the various techniques and technologies utilized in this area. The paper covers the statistical methods, machine learning algorithms, and deep learning techniques employed to analyze crime data, while also examining their effectiveness and limitations. We propose a methodological taxonomy that classifies crime prediction algorithms into specific techniques. This taxonomy is structured into four tiers, including methodology category, methodology sub-category, methodology techniques, and methodology sub-techniques. Empirical and experimental evaluations are provided to rank the different techniques. The empirical evaluation assesses the crime prediction techniques based on four criteria, while the experimental evaluation ranks the algorithms that employ the same sub-technique, the different sub-techniques that employ the same technique, the different techniques that employ the same methodology sub-category, the different methodology sub-categories within the same category, and the different methodology categories. The combination of methodological taxonomy, empirical evaluations, and experimental comparisons allows for a nuanced and comprehensive understanding of crime prediction algorithms, aiding researchers in making informed decisions. Finally, the paper provides a glimpse into the future of crime prediction techniques, highlighting potential advancements and opportunities for further research in this field

crime prediction, effectiveness, prediction accuracy, (16 more...)

arXiv.org Artificial Intelligence

2403.0078

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
North America > United States > Maryland > Baltimore (0.14)
(50 more...)

Genre:

Overview (1.00)
Instructional Material (1.00)
Research Report > New Finding (0.93)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
(9 more...)

Add feedback

Unsupervised Feature Selection for the k -means Clustering Problem

Neural Information Processing SystemsFeb-16-2024, 11:50:53 GMT

We present a novel feature selection algorithm for the k -means clustering problem. Our algorithm is randomized and, assuming an accuracy parameter \epsilon \in (0,1), selects and appropriately rescales in an unsupervised manner \Theta(k \log(k / \epsilon) / \epsilon 2) features from a dataset of arbitrary dimensions. We prove that, if we run any \gamma -approximate k -means algorithm ( \gamma \geq 1) on the features selected using our method, we can find a (1 (1 \epsilon)\gamma) -approximate partition with high probability.

algorithm, clustering problem, unsupervised feature selection

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Random Projections for k -means Clustering

Neural Information Processing SystemsFeb-16-2024, 10:06:55 GMT

We prove that any set of n points in d dimensions (rows in a matrix A \in \RR {n \times d}) can be projected into t \Omega(k / \eps 2) dimensions, for any \eps \in (0,1/3), in O(n d \lceil \eps {-2} k/ \log(d) \rceil) time, such that with constant probability the optimal k -partition of the point set is preserved within a factor of 2 \eps . The projection is done by post-multiplying A with a d \times t random matrix R having entries 1/\sqrt{t} or -1/\sqrt{t} with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.

clustering, probability, random projection, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Does Twinning Vehicular Networks Enhance Their Performance in Dense Areas?

Al-Shareeda, Sarah, Oktug, Sema F., Yaslan, Yusuf, Yurdakul, Gokhan, Canberk, Berk

arXiv.org Artificial IntelligenceFeb-16-2024

This paper investigates the potential of Digital Twins (DTs) to enhance network performance in densely populated urban areas, specifically focusing on vehicular networks. The study comprises two phases. In Phase I, we utilize traffic data and AI clustering to identify critical locations, particularly in crowded urban areas with high accident rates. In Phase II, we evaluate the advantages of twinning vehicular networks through three deployment scenarios: edge-based twin, cloud-based twin, and hybrid-based twin. Our analysis demonstrates that twinning significantly reduces network delays, with virtual twins outperforming physical networks. Virtual twins maintain low delays even with increased vehicle density, such as 15.05 seconds for 300 vehicles. Moreover, they exhibit faster computational speeds, with cloud-based twins being 1.7 times faster than edge twins in certain scenarios. These findings provide insights for efficient vehicular communication and underscore the potential of virtual twins in enhancing vehicular networks in crowded areas while emphasizing the importance of considering real-world factors when making deployment decisions.

artificial intelligence, cloud computing, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2402.10701

Country:

Asia > Middle East > UAE (0.46)
Asia > Middle East > Republic of Türkiye (0.29)
North America > United States (0.14)
Europe > Slovakia (0.14)

Genre: Research Report (1.00)

Industry:

Transportation (1.00)
Information Technology > Services (0.69)
Energy > Oil & Gas (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.93)
(2 more...)

Add feedback