AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Text Classification: A Review, Empirical, and Experimental Evaluation

Taha, Kamal, Yoo, Paul D., Yeun, Chan, Taha, Aya

arXiv.org Artificial IntelligenceJan-11-2024

The explosive and widespread growth of data necessitates the use of text classification to extract crucial information from vast amounts of data. Consequently, there has been a surge of research in both classical and deep learning text classification methods. Despite the numerous methods proposed in the literature, there is still a pressing need for a comprehensive and up-to-date survey. Existing survey papers categorize algorithms for text classification into broad classes, which can lead to the misclassification of unrelated algorithms and incorrect assessments of their qualities and behaviors using the same metrics. To address these limitations, our paper introduces a novel methodological taxonomy that classifies algorithms hierarchically into fine-grained classes and specific techniques. The taxonomy includes methodology categories, methodology techniques, and methodology sub-techniques. Our study is the first survey to utilize this methodological taxonomy for classifying algorithms for text classification. Furthermore, our study also conducts empirical evaluation and experimental comparisons and rankings of different algorithms that employ the same specific sub-technique, different sub-techniques within the same technique, different techniques within the same category, and categories

algorithm, classification, text classification, (15 more...)

arXiv.org Artificial Intelligence

2401.12982

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
(11 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education (1.00)
Information Technology > Services (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Add feedback

Learning from Semi-Factuals: A Debiased and Semantic-Aware Framework for Generalized Relation Discovery

Wang, Jiaxin, Zhang, Lingling, Liu, Jun, Guo, Tianlin, Wu, Wenjun

arXiv.org Artificial IntelligenceJan-11-2024

We introduce a novel task, called Generalized Relation Discovery (GRD), for open-world relation extraction. GRD aims to identify unlabeled instances in existing pre-defined relations or discover novel relations by assigning instances to clusters as well as providing specific meanings for these clusters. The key challenges of GRD are how to mitigate the serious model biases caused by labeled pre-defined relations to learn effective relational representations and how to determine the specific semantics of novel relations during classifying or clustering unlabeled instances. We then propose a novel framework, SFGRD, for this task to solve the above issues by learning from semi-factuals in two stages. The first stage is semi-factual generation implemented by a tri-view debiased relation representation module, in which we take each original sentence as the main view and design two debiased views to generate semi-factual examples for this sentence. The second stage is semi-factual thinking executed by a dual-space tri-view collaborative relation learning module, where we design a cluster-semantic space and a class-index space to learn relational semantics and relation label indices, respectively. In addition, we devise alignment and selection strategies to integrate two spaces and establish a self-supervised learning loop for unlabeled data by doing semi-factual thinking across three views. Extensive experimental results show that SFGRD surpasses state-of-the-art models in terms of accuracy by 2.36\% $\sim$5.78\% and cosine similarity by 32.19\%$\sim$ 84.45\% for relation label index and relation semantic quality, respectively. To the best of our knowledge, we are the first to exploit the efficacy of semi-factuals in relation extraction.

computational linguistic, novel relation, relation, (15 more...)

arXiv.org Artificial Intelligence

2401.06327

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Shaanxi Province > Xi'an (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(15 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.68)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPU

Sao, Piyush, Prokopenko, Andrey, Lebrun-Grandié, Damien

arXiv.org Artificial IntelligenceJan-11-2024

This paper presents \pandora, a novel parallel algorithm for efficiently constructing dendrograms for single-linkage hierarchical clustering, including \hdbscan. Traditional dendrogram construction methods from a minimum spanning tree (MST), such as agglomerative or divisive techniques, often fail to efficiently parallelize, especially with skewed dendrograms common in real-world data. \pandora addresses these challenges through a unique recursive tree contraction method, which simplifies the tree for initial dendrogram construction and then progressively reconstructs the complete dendrogram. This process makes \pandora asymptotically work-optimal, independent of dendrogram skewness. All steps in \pandora are fully parallel and suitable for massively threaded accelerators such as GPUs. Our implementation is written in Kokkos, providing support for both CPUs and multi-vendor GPUs (e.g., Nvidia, AMD). The multithreaded version of \pandora is 2.2$\times$ faster than the current best-multithreaded implementation, while the GPU \pandora implementation achieved 6-20$\times$ on \amdgpu and 10-37$\times$ on \nvidiagpu speed-up over multithreaded \pandora. These advancements lead to up to a 6-fold speedup for \hdbscan on GPUs over the current best, which only offload MST construction to GPUs and perform multithreaded dendrogram construction.

algorithm, contraction, dendrogram, (16 more...)

arXiv.org Artificial Intelligence

2401.06089

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Saudi Arabia (0.04)
Europe > United Kingdom > England > Tyne and Wear > Sunderland (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)

Genre: Research Report (0.64)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Energy (0.68)
Health & Medicine (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Diversity-aware clustering: Computational Complexity and Approximation Algorithms

Thejaswi, Suhas, Gadekar, Ameet, Ordozgoiti, Bruno, Gionis, Aristides

arXiv.org Artificial IntelligenceJan-10-2024

Diversity is an essential design choice across numerous real-world contexts, spanning social environments [1], organizational structures [2], and demographic studies [3]. Embracing diversity entails acknowledging and incorporating multifaceted characteristics within groups. This concept holds profound relevance in addressing real-world challenges, particularly in scenarios where intersectionality -- the interconnected nature of social categorizations such as gender, ethnicity, religion, socio-economic status and sexual orientation -- plays a pivotal role [4, 5]. Consider the task of constituting a representative committee that accurately mirrors the demography of a broader population. In the pursuit of diversifying, and recognizing its significance in the context of fairness, it is imperative to ensure representation from various groups based on their gender, ethnicity, and economic status, among other [6]. In reality, individuals belong to multiple social categories, for example, a person could be a woman of a specific ethnic background and economic group.

algorithm, approximation algorithm, div-k-median, (15 more...)

arXiv.org Artificial Intelligence

2401.05502

Country:

North America > United States > New York (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
Europe > Germany (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

The recursive scheme of clustering

Miniak-Górecka, Alicja, Podlaski, Krzysztof, Gwizdałła, Tomasz

arXiv.org Artificial IntelligenceJan-10-2024

The problem of data clustering is one of the most important in data analysis. It can be problematic when dealing with experimental data characterized by measurement uncertainties and errors. Our paper proposes a recursive scheme for clustering data obtained in geographical (climatological) experiments. The discussion of results obtained by k-means and SOM methods with the developed recursive procedure is presented. We show that the clustering using the new approach gives more acceptable results when compared to experts assessments.

clustering, dataset, recursive scheme, (13 more...)

arXiv.org Artificial Intelligence

2401.05479

Country:

Europe > Poland > Łódź Province > Łódź (0.05)
North America > United States > California > Orange County > Irvine (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Grimoire is All You Need for Enhancing Large Language Models

Chen, Ding, Song, Shichao, Yu, Qingchen, Li, Zhiyu, Wang, Wenjin, Xiong, Feiyu, Tang, Bo

arXiv.org Artificial IntelligenceJan-10-2024

In-context Learning (ICL) is one of the key methods for enhancing the performance of large language models on specific tasks by providing a set of few-shot examples. However, the ICL capability of different types of models shows significant variation due to factors such as model architecture, volume of learning data, and the size of parameters. Generally, the larger the model's parameter size and the more extensive the learning data, the stronger its ICL capability. In this paper, we propose a method SLEICL that involves learning from examples using strong language models and then summarizing and transferring these learned skills to weak language models for inference and application. This ensures the stability and effectiveness of ICL. Compared to directly enabling weak language models to learn from prompt examples, SLEICL reduces the difficulty of ICL for these models. Our experiments, conducted on up to eight datasets with five language models, demonstrate that weak language models achieve consistent improvement over their own zero-shot or few-shot capabilities using the SLEICL method. Some weak language models even surpass the performance of GPT4-1106-preview (zero-shot) with the aid of SLEICL.

grimoire, language model, weak language model, (14 more...)

arXiv.org Artificial Intelligence

2401.03385

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(8 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A general theory for robust clustering via trimmed mean

Jana, Soham, Fan, Jianqing, Kulkarni, Sanjeev

arXiv.org Machine LearningJan-10-2024

Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data. Many recent results focus primarily on optimal mislabeling guarantees, when data are distributed around centroids with sub-Gaussian errors. Yet, the restrictive sub-Gaussian model is often invalid in practice, since various real-world applications exhibit heavy tail distributions around the centroids or suffer from possible adversarial attacks that call for robust clustering with a robust data-driven initialization. In this paper, we introduce a hybrid clustering technique with a novel multivariate trimmed mean type centroid estimate to produce mislabeling guarantees under a weak initialization condition for general error distributions around the centroids. A matching lower bound is derived, up to factors depending on the number of clusters. In addition, our approach also produces the optimal mislabeling even in the presence of adversarial outliers. Our results reduce to the sub-Gaussian case when errors follow sub-Gaussian distributions. To solve the problem thoroughly, we also present novel data-driven robust initialization techniques and show that, with probabilities approaching one, these initial centroid estimates are sufficiently good for the subsequent clustering algorithm to achieve the optimal mislabeling rates. Furthermore, we demonstrate that the Lloyd algorithm is suboptimal for more than two clusters even when errors are Gaussian, and for two clusters when errors distributions have heavy tails. Both simulated data and real data examples lend further support to both of our robust initialization procedure and clustering algorithm.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2401.05574

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Switzerland > Neuchâtel > Neuchâtel (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
Africa > South Africa (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Towards the mathematical foundation of the minimum enclosing ball and related problems

Vrahatis, Michael N.

arXiv.org Artificial IntelligenceJan-9-2024

Theoretical background is provided towards the mathematical foundation of the minimum enclosing ball problem. This problem concerns the determination of the unique spherical surface of smallest radius enclosing a given bounded set in the d-dimensional Euclidean space. The study of several problems that are similar or related to the minimum enclosing ball problem has received a considerable impetus from the large amount of applications of these problems in various fields of science and technology. The proposed theoretical framework is based on several enclosing (covering) and partitioning (clustering) theorems and provides among others bounds and relations between the circumradius, inradius, diameter and width of a set. These enclosing and partitioning theorems are considered as cornerstones in the field that strongly influencing developments and generalizations to other spaces and non-Euclidean geometries.

jung, math, theorem, (15 more...)

arXiv.org Artificial Intelligence

2402.06629

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
North America > United States > New York > New York County > New York City (0.14)
Europe > Germany (0.04)
(18 more...)

Genre: Research Report (0.81)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)

Add feedback

Identifying Best Practice Melting Patterns in Induction Furnaces: A Data-Driven Approach Using Time Series KMeans Clustering and Multi-Criteria Decision Making

Howard, Daniel Anthony, Jørgensen, Bo Nørregaard, Ma, Zheng

arXiv.org Artificial IntelligenceJan-9-2024

Improving energy efficiency in industrial production processes is crucial for competitiveness, and compliance with climate policies. This paper introduces a data-driven approach to identify optimal melting patterns in induction furnaces. Through time-series K-means clustering the melting patterns could be classified into distinct clusters based on temperature profiles. Using the elbow method, 12 clusters were identified, representing the range of melting patterns. Performance parameters such as melting time, energy-specific performance, and carbon cost were established for each cluster, indicating furnace efficiency and environmental impact. Multiple criteria decision-making methods including Simple Additive Weighting, Multiplicative Exponential Weighting, Technique for Order of Preference by Similarity to Ideal Solution, modified TOPSIS, and VlseKriterijumska Optimizacija I Kompromisno Resenje were utilized to determine the best-practice cluster. The study successfully identified the cluster with the best performance. Implementing the best practice operation resulted in an 8.6 % reduction in electricity costs, highlighting the potential energy savings in the foundry.

energy efficiency, furnace, opération, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-48649-4_16

2401.04751

Country:

Europe > Denmark > Southern Denmark (0.05)
South America > Brazil > São Paulo > Campinas (0.04)
Europe > Northern Europe (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Energy > Power Industry (1.00)
Law > Environmental Law (0.67)
Materials > Metals & Mining > Iron (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.71)

Add feedback

Masked AutoEncoder for Graph Clustering without Pre-defined Cluster Number k

Ma, Yuanchi, He, Hui, Lei, Zhongxiang, Niu, Zhendong

arXiv.org Artificial IntelligenceJan-9-2024

Graph clustering algorithms with autoencoder structures have recently gained popularity due to their efficient performance and low training cost. However, for existing graph autoencoder clustering algorithms based on GCN or GAT, not only do they lack good generalization ability, but also the number of clusters clustered by such autoencoder models is difficult to determine automatically. To solve this problem, we propose a new framework called Graph Clustering with Masked Autoencoders (GCMA). It employs our designed fusion autoencoder based on the graph masking method for the fusion coding of graph. It introduces our improved density-based clustering algorithm as a second decoder while decoding with multi-target reconstruction. By decoding the mask embedding, our model can capture more generalized and comprehensive knowledge. The number of clusters and clustering results can be output end-to-end while improving the generalization ability. As a nonparametric class method, extensive experiments demonstrate the superiority of \textit{GCMA} over state-of-the-art baselines.

algorithm, graph, information, (15 more...)

arXiv.org Artificial Intelligence

2401.04741

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback