AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Influence of various text embeddings on clustering performance in NLP

Saha, Rohan

arXiv.org Artificial IntelligenceMay-4-2023

With the advent of e-commerce platforms, reviews are crucial for customers to assess the credibility of a product. The star ratings do not always match the review text written by the customer. For example, a three star rating (out of five) may be incongruous with the review text, which may be more suitable for a five star review. A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups. In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms. We use contextual (BERT) and non-contextual (Word2Vec) text embeddings to represent the text and measure their impact of three classes on clustering algorithms - partitioning based (KMeans), single linkage agglomerative hierarchical, and density based (DBSCAN and HDBSCAN), each with various experimental settings. We use the silhouette score, adjusted rand index score, and cluster purity score metrics to evaluate the performance of the algorithms and discuss the impact of different embeddings on the clustering performance. Our results indicate that the type of embedding chosen drastically affects the performance of the algorithm, the performance varies greatly across different types of clustering algorithms, no embedding type is better than the other, and DBSCAN outperforms KMeans and single linkage agglomerative clustering but also labels more data points as outliers. We provide a thorough comparison of the performances of different algorithms and provide numerous ideas to foster further research in the domain of text clustering.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2305.03144

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Services > e-Commerce Services (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Natural language processing on customer note data

Hilditch, Andrew, Webb, David, Baca, Jozef, Armitage, Tom, Shardlow, Matthew, Appleby, Peter

arXiv.org Artificial IntelligenceMay-3-2023

Automatic analysis of customer data for businesses is an area that is of interest to companies. Business to business data is studied rarely in academia due to the sensitive nature of such information. Applying natural language processing can speed up the analysis of prohibitively large sets of data. This paper addresses this subject and applies sentiment analysis, topic modelling and keyword extraction to a B2B data set. We show that accurate sentiment can be extracted from the notes automatically and the notes can be sorted by relevance into different topics. We see that without clear separation topics can lack relevance to a business context.

large language model, machine learning, sentiment, (22 more...)

arXiv.org Artificial Intelligence

2305.02029

Country:

Europe > United Kingdom (0.14)
Asia > Middle East > Jordan (0.04)
Asia > Indonesia (0.04)

Genre:

Workflow (0.93)
Overview (0.67)
Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
(2 more...)

Add feedback

ImGCL: Revisiting Graph Contrastive Learning on Imbalanced Node Classification

Zeng, Liang, Li, Lanqing, Gao, Ziqi, Zhao, Peilin, Li, Jian

arXiv.org Artificial IntelligenceMay-3-2023

Graph contrastive learning (GCL) has attracted a surge of attention due to its superior performance for learning node/graph representations without labels. However, in practice, the underlying class distribution of unlabeled nodes for the given graph is usually imbalanced. This highly imbalanced class distribution inevitably deteriorates the quality of learned node representations in GCL. Indeed, we empirically find that most state-of-the-art GCL methods cannot obtain discriminative representations and exhibit poor performance on imbalanced node classification. Motivated by this observation, we propose a principled GCL framework on Imbalanced node classification (ImGCL), which automatically and adaptively balances the representations learned from GCL without labels. Specifically, we first introduce the online clustering based progressively balanced sampling (PBS) method with theoretical rationale, which balances the training sets based on pseudo-labels obtained from learned representations in GCL. We then develop the node centrality based PBS method to better preserve the intrinsic structure of graphs, by upweighting the important nodes of the given graph. Extensive experiments on multiple imbalanced graph datasets and imbalanced settings demonstrate the effectiveness of our proposed framework, which significantly improves the performance of the recent state-of-the-art GCL methods. Further experimental ablations and analyses show that the ImGCL framework consistently improves the representation quality of nodes in under-represented (tail) classes.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2205.11332

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

FastAMI -- a Monte Carlo Approach to the Adjustment for Chance in Clustering Comparison Metrics

Klede, Kai, Schwinn, Leo, Zanca, Dario, Eskofier, Björn

arXiv.org Artificial IntelligenceMay-3-2023

Clustering is at the very core of machine learning, and its applications proliferate with the increasing availability of data. However, as datasets grow, comparing clusterings with an adjustment for chance becomes computationally difficult, preventing unbiased ground-truth comparisons and solution selection. We propose FastAMI, a Monte Carlo-based method to efficiently approximate the Adjusted Mutual Information (AMI) and extend it to the Standardized Mutual Information (SMI). The approach is compared with the exact calculation and a recently developed variant of the AMI based on pairwise permutations, using both synthetic and real data. In contrast to the exact calculation our method is fast enough to enable these adjusted information-theoretic comparisons for large datasets while maintaining considerably more accurate results than the pairwise approach.

artificial intelligence, machine learning, mutual information, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1609/aaai.v37i7.26003

2305.03022

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Netherlands > South Holland > Leiden (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.30)

Add feedback

Jacobian-Scaled K-means Clustering for Physics-Informed Segmentation of Reacting Flows

Barwey, Shivam, Raman, Venkat

arXiv.org Artificial IntelligenceMay-2-2023

This work introduces Jacobian-scaled K-means (JSK-means) clustering, which is a physics-informed clustering strategy centered on the K-means framework. The method allows for the injection of underlying physical knowledge into the clustering procedure through a distance function modification: instead of leveraging conventional Euclidean distance vectors, the JSK-means procedure operates on distance vectors scaled by matrices obtained from dynamical system Jacobians evaluated at the cluster centroids. The goal of this work is to show how the JSK-means algorithm -- without modifying the input dataset -- produces clusters that capture regions of dynamical similarity, in that the clusters are redistributed towards high-sensitivity regions in phase space and are described by similarity in the source terms of samples instead of the samples themselves. The algorithm is demonstrated on a complex reacting flow simulation dataset (a channel detonation configuration), where the dynamics in the thermochemical composition space are known through the highly nonlinear and stiff Arrhenius-based chemical source terms. Interpretations of cluster partitions in both physical space and composition space reveal how JSK-means shifts clusters produced by standard K-means towards regions of high chemical sensitivity (e.g., towards regions of peak heat release rate near the detonation reaction zone). The findings presented here illustrate the benefits of utilizing Jacobian-scaled distances in clustering techniques, and the JSK-means method in particular displays promising potential for improving former partition-based modeling strategies in reacting flow (and other multi-physics) applications.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.01539

Country: North America > United States (0.67)

Genre: Research Report (0.82)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Beyond Homophily: Reconstructing Structure for Graph-agnostic Clustering

Pan, Erlin, Kang, Zhao

arXiv.org Artificial IntelligenceMay-2-2023

However, they are designed on of connected nodes belong to different classes (Pei et al., the homophilic assumption of graph and clustering 2020; Xie et al., 2023). Traditional GNNs learn representations on heterophilic graph is overlooked. Due to via message passing mechanism under the assumption the lack of labels, it is impossible to first identify of homophily (Fang et al., 2022). Facing heterophilic graphs, a graph as homophilic or heterophilic before a previous approaches mainly suffer two limitations. On the suitable GNN model can be found. Hence, clustering one hand, the local neighbors in a graph are nodes that are on real-world graph with various levels of proximally located, while nodes that are semantically similar homophily poses a new challenge to the graph might be far apart on heterophilic graph (Zhu et al., research community. To fill this gap, we propose 2020). Thus, existing techniques fail to capture long-range a novel graph clustering method, which contains information from distant nodes. On the other hand, they three key components: graph reconstruction, don't distinguish similar and dissimilar neighbors, which a mixed filter, and dual graph clustering carry different amounts of information.

artificial intelligence, graph, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.02931

Country:

North America > United States > Texas (0.06)
North America > United States > Wisconsin (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Add feedback

A Parameter-free Adaptive Resonance Theory-based Topological Clustering Algorithm Capable of Continual Learning

Masuyama, Naoki, Takebayashi, Takanori, Nojima, Yusuke, Loo, Chu Kiong, Ishibuchi, Hisao, Wermter, Stefan

arXiv.org Artificial IntelligenceMay-2-2023

In general, a similarity threshold (i.e., a vigilance parameter) for a node learning process in Adaptive Resonance Theory (ART)-based algorithms has a significant impact on clustering performance. In addition, an edge deletion threshold in a topological clustering algorithm plays an important role in adaptively generating well-separated clusters during a self-organizing process. In this paper, we propose a new parameter-free ART-based topological clustering algorithm capable of continual learning by introducing parameter estimation methods. Experimental results with synthetic and real-world datasets show that the proposed algorithm has superior clustering performance to the state-of-the-art clustering algorithms without any parameter pre-specifications.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.01507

Country:

Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Germany > Hamburg (0.04)
(5 more...)

Genre: Research Report (0.81)

Industry: Education > Educational Setting (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

CKmeans and FCKmeans : Two deterministic initialization procedures for Kmeans algorithm using a modified crowding distance

Layeb, Abdesslem

arXiv.org Artificial IntelligenceMay-1-2023

This paper presents two novel deterministic initialization procedures for K-means clustering based on a modified crowding distance. The procedures, named CKmeans and FCKmeans, use more crowded points as initial centroids. Experimental studies on multiple datasets demonstrate that the proposed approach outperforms Kmeans and Kmeans++ in terms of clustering accuracy. The effectiveness of CKmeans and FCKmeans is attributed to their ability to select better initial centroids based on the modified crowding distance. Overall, the proposed approach provides a promising alternative for improving K-means clustering.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2304.09989

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Improving ClusterGAN Using Self-Augmented Information Maximization of Disentangling Latent Spaces

Dam, Tanmoy, Anavatti, Sreenatha G., Abbass, Hussein A.

arXiv.org Artificial IntelligenceMay-1-2023

Since their introduction in the last few years, conditional generative models have seen remarkable achievements. However, they often need the use of large amounts of labelled information. By using unsupervised conditional generation in conjunction with a clustering inference network, ClusterGAN has recently been able to achieve impressive clustering results. Since the real conditional distribution of data is ignored, the clustering inference network can only achieve inferior clustering performance by considering only uniform prior based generative samples. However, the true distribution is not necessarily balanced. Consequently, ClusterGAN fails to produce all modes, which results in sub-optimal clustering inference network performance. So, it is important to learn the prior, which tries to match the real distribution in an unsupervised way. In this paper, we propose self-augmentation information maximization improved ClusterGAN (SIMI-ClusterGAN) to learn the distinctive priors from the data directly. The proposed SIMI-ClusterGAN consists of four deep neural networks: self-augmentation prior network, generator, discriminator and clustering inference network. The proposed method has been validated using seven benchmark data sets and has shown improved performance over state-of-the art methods. To demonstrate the superiority of SIMI-ClusterGAN performance on imbalanced dataset, we have discussed two imbalanced conditions on MNIST datasets with one-class imbalance and three classes imbalanced cases. The results highlight the advantages of SIMI-ClusterGAN.

artificial intelligence, latent space, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2107.12706

Country:

North America > United States (0.28)
Oceania > Australia (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Time series clustering based on prediction accuracy of global forecasting models

Oriona, Ángel López, Manso, Pablo Montero, Fernández, José Antonio Vilar

arXiv.org Artificial IntelligenceApr-30-2023

In this paper, a novel method to perform model-based clustering of time series is proposed. The procedure relies on two iterative steps: (i) K global forecasting models are fitted via pooling by considering the series pertaining to each cluster and (ii) each series is assigned to the group associated with the model producing the best forecasts according to a particular criterion. Unlike most techniques proposed in the literature, the method considers the predictive accuracy as the main element for constructing the clustering partition, which contains groups jointly minimizing the overall forecasting error. Thus, the approach leads to a new clustering paradigm where the quality of the clustering solution is measured in terms of its predictive capability. In addition, the procedure gives rise to an effective mechanism for selecting the number of clusters in a time series database and can be used in combination with any class of regression model. An extensive simulation study shows that our method outperforms several alternative techniques concerning both clustering effectiveness and predictive accuracy. The approach is also applied to perform clustering in several datasets used as standard benchmarks in the time series literature, obtaining great results.

artificial intelligence, global model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2305.00473

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Spain > Galicia > A Coruña Province > A Coruña (0.04)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)

Add feedback