Goto

Collaborating Authors

 Clustering


More Discriminative Sentence Embeddings via Semantic Graph Smoothing

arXiv.org Artificial Intelligence

Simplified versions of this Text categorization, also known as document categorization, deep architecture have been proposed wherein the is a natural language processing (NLP) learning of large sets of weights has been deemed task that involves arranging texts into coherent unnecessary. Their representation learning scheme groups based on their content. It has many applications works similar to Laplacian smoothing and, by extension, such as spam detection (Jindal and Liu, 2007), graph filtering. We can give as examples sentiment analysis (Melville et al., 2009), content of these simplified techniques the simple graph convolution recommendation (Pazzani and Billsus, 2007), etc. (SGC) (Wu et al., 2019), and the simple There are two main approaches to text categorization: spectral graph convolution (S GC) (Zhu and Koniusz, classification (supervised learning) and clustering 2020). Some researchers used GCNs for the (unsupervised learning).


CCFC++: Enhancing Federated Clustering through Feature Decorrelation

arXiv.org Artificial Intelligence

This field has seen notable advancements through its marriage with contrastive learning, exemplified by Cluster-Contrastive Federated Clustering (CCFC). However, CCFC suffers from heterogeneous data across clients, leading to poor and unrobust performance. Our study conducts both empirical and theoretical analyses to understand the impact of heterogeneous data on CCFC. Findings indicate that increased data heterogeneity exacerbates dimensional collapse in CCFC, evidenced by increased correlations across multiple dimensions of the learned representations. To address this, we introduce a decorrelation regularizer to CCFC. Benefiting from the regularizer, the improved method effectively mitigates the detrimental effects of data heterogeneity, and achieves superior performance, as evidenced by a marked increase in NMI scores, with the gain reaching as high as 0.32 in the most pronounced case.


Evaluation of Country Dietary Habits Using Machine Learning Techniques in Relation to Deaths from COVID-19

arXiv.org Artificial Intelligence

COVID-19 disease has affected almost every country in the world. The large number of infected people and the different mortality rates between countries has given rise to many hypotheses about the key points that make the virus so lethal in some places. In this study, the eating habits of 170 countries were evaluated in order to find correlations between these habits and mortality rates caused by COVID-19 using machine learning techniques that group the countries together according to the different distribution of fat, energy, and protein across 23 different types of food, as well as the amount ingested in kilograms. Results shown how obesity and the high consumption of fats appear in countries with the highest death rates, whereas countries with a lower rate have a higher level of cereal consumption accompanied by a lower total average intake of kilocalories.


Kernel KMeans clustering splits for end-to-end unsupervised decision trees

arXiv.org Machine Learning

Trees are convenient models for obtaining explainable predictions on relatively small datasets. Although there are many proposals for the end-to-end construction of such trees in supervised learning, learning a tree end-to-end for clustering without labels remains an open challenge. As most works focus on interpreting with trees the result of another clustering algorithm, we present here a novel end-to-end trained unsupervised binary tree for clustering: Kauri. This method performs a greedy maximisation of the kernel KMeans objective without requiring the definition of centroids. We compare this model on multiple datasets with recent unsupervised trees and show that Kauri performs identically when using a linear kernel. For other kernels, Kauri often outperforms the concatenation of kernel KMeans and a CART decision tree.


Cluster Metric Sensitivity to Irrelevant Features

arXiv.org Machine Learning

Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continually growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics as the data can be interrogated for many different purposes. This however leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabeled tasks. In this paper, we investigate the sensitivity of clustering performance noisy uncorrelated variables iteratively added to baseline datasets with well defined clusters. We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted rand index (ARI) and normalised mutual information (NMI) when the irrelevant features are Gaussian distributed. For Uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies-Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks.


Empirical Density Estimation based on Spline Quasi-Interpolation with applications to Copulas clustering modeling

arXiv.org Machine Learning

Density estimation is a fundamental technique employed in various fields to model and to understand the underlying distribution of data. The primary objective of density estimation is to estimate the probability density function of a random variable. This process is particularly valuable when dealing with univariate or multivariate data and is essential for tasks such as clustering, anomaly detection, and generative modeling. In this paper we propose the mono-variate approximation of the density using spline quasi interpolation and we applied it in the context of clustering modeling. The clustering technique used is based on the construction of suitable multivariate distributions which rely on the estimation of the monovariate empirical densities (marginals). Such an approximation is achieved by using the proposed spline quasi-interpolation, while the joint distributions to model the sought clustering partition is constructed with the use of copulas functions. In particular, since copulas can capture the dependence between the features of the data independently from the marginal distributions, a finite mixture copula model is proposed. The presented algorithm is validated on artificial and real datasets.


Empirical and Experimental Insights into Data Mining Techniques for Crime Prediction: A Comprehensive Survey

arXiv.org Artificial Intelligence

This survey paper presents a comprehensive analysis of crime prediction methodologies, exploring the various techniques and technologies utilized in this area. The paper covers the statistical methods, machine learning algorithms, and deep learning techniques employed to analyze crime data, while also examining their effectiveness and limitations. We propose a methodological taxonomy that classifies crime prediction algorithms into specific techniques. This taxonomy is structured into four tiers, including methodology category, methodology sub-category, methodology techniques, and methodology sub-techniques. Empirical and experimental evaluations are provided to rank the different techniques. The empirical evaluation assesses the crime prediction techniques based on four criteria, while the experimental evaluation ranks the algorithms that employ the same sub-technique, the different sub-techniques that employ the same technique, the different techniques that employ the same methodology sub-category, the different methodology sub-categories within the same category, and the different methodology categories. The combination of methodological taxonomy, empirical evaluations, and experimental comparisons allows for a nuanced and comprehensive understanding of crime prediction algorithms, aiding researchers in making informed decisions. Finally, the paper provides a glimpse into the future of crime prediction techniques, highlighting potential advancements and opportunities for further research in this field


Unsupervised Feature Selection for the k -means Clustering Problem

Neural Information Processing Systems

We present a novel feature selection algorithm for the k -means clustering problem. Our algorithm is randomized and, assuming an accuracy parameter \epsilon \in (0,1), selects and appropriately rescales in an unsupervised manner \Theta(k \log(k / \epsilon) / \epsilon 2) features from a dataset of arbitrary dimensions. We prove that, if we run any \gamma -approximate k -means algorithm ( \gamma \geq 1) on the features selected using our method, we can find a (1 (1 \epsilon)\gamma) -approximate partition with high probability.


Random Projections for k -means Clustering

Neural Information Processing Systems

We prove that any set of n points in d dimensions (rows in a matrix A \in \RR {n \times d}) can be projected into t \Omega(k / \eps 2) dimensions, for any \eps \in (0,1/3), in O(n d \lceil \eps {-2} k/ \log(d) \rceil) time, such that with constant probability the optimal k -partition of the point set is preserved within a factor of 2 \eps . The projection is done by post-multiplying A with a d \times t random matrix R having entries 1/\sqrt{t} or -1/\sqrt{t} with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.


Does Twinning Vehicular Networks Enhance Their Performance in Dense Areas?

arXiv.org Artificial Intelligence

This paper investigates the potential of Digital Twins (DTs) to enhance network performance in densely populated urban areas, specifically focusing on vehicular networks. The study comprises two phases. In Phase I, we utilize traffic data and AI clustering to identify critical locations, particularly in crowded urban areas with high accident rates. In Phase II, we evaluate the advantages of twinning vehicular networks through three deployment scenarios: edge-based twin, cloud-based twin, and hybrid-based twin. Our analysis demonstrates that twinning significantly reduces network delays, with virtual twins outperforming physical networks. Virtual twins maintain low delays even with increased vehicle density, such as 15.05 seconds for 300 vehicles. Moreover, they exhibit faster computational speeds, with cloud-based twins being 1.7 times faster than edge twins in certain scenarios. These findings provide insights for efficient vehicular communication and underscore the potential of virtual twins in enhancing vehicular networks in crowded areas while emphasizing the importance of considering real-world factors when making deployment decisions.