AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Clustering Aggregation as Maximum-Weight Independent Set

Nan Li, Longin J. Latecki

Neural Information Processing SystemsFeb-11-2025, 17:55:15 GMT

We formulate clustering aggregation as a special instance of Maximum-Weight Independent Set (MWIS) problem. For a given dataset, an attributed graph is constructed from the union of the input clusterings generated by different underlying clustering algorithms with different parameters. The vertices, which represent the distinct clusters, are weighted by an internal index measuring both cohesion and separation. The edges connect the vertices whose corresponding clusters overlap. Intuitively, an optimal aggregated clustering can be obtained by selecting an optimal subset of non-overlapping clusters partitioning the dataset together.

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

URECA: The Chain of Two Minimum Set Cover Problems exists behind Adaptation to Shifts in Semantic Code Search

Choi, Seok-Ung, Hahn, Joonghyuk, Han, Yo-Sub

arXiv.org Artificial IntelligenceFeb-11-2025

Adaptation is to make model learn the patterns shifted from the training distribution. In general, this adaptation is formulated as the minimum entropy problem. However, the minimum entropy problem has inherent limitation -- shifted initialization cascade phenomenon. We extend the relationship between the minimum entropy problem and the minimum set cover problem via Lebesgue integral. This extension reveals that internal mechanism of the minimum entropy problem ignores the relationship between disentangled representations, which leads to shifted initialization cascade. From the analysis, we introduce a new clustering algorithm, Union-find based Recursive Clustering Algorithm~(URECA). URECA is an efficient clustering algorithm for the leverage of the relationships between disentangled representations. The update rule of URECA depends on Thresholdly-Updatable Stationary Assumption to dynamics as a released version of Stationary Assumption. This assumption helps URECA to transport disentangled representations with no errors based on the relationships between disentangled representations. URECA also utilize simulation trick to efficiently cluster disentangled representations. The wide range of evaluations show that URECA achieves consistent performance gains for the few-shot adaptation to diverse types of shifts along with advancement to State-of-The-Art performance in CoSQA in the scenario of query shift.

artificial intelligence, disentangled representation, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.07494

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.74)

Add feedback

A statistically consistent measure of Semantic Variability using Language Models

Liu, Yi

arXiv.org Artificial IntelligenceFeb-11-2025

One of the key applications of large language models (LLMs), with temperature generative models that has garnered significant interest set to 0 and top p set to 1, across eight common is the development of specialized chatbots tasks and five identical trials per task. The study with domain-specific expertise such as legal and aimed to assess the repeatability of model outputs healthcare (Lexis; Mesko, 2023). These applications by examining whether the generated strings were illustrate how generative models can improve consistent between runs. The authors found that decision-making and improve the efficiency of professional none of the LLMs demonstrated consistent performance services in specialized fields.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.00507

Country:

North America > United States > Washington > King County > Seattle (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report (0.84)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

DISCOVER: Data-driven Identification of Sub-activities via Clustering and Visualization for Enhanced Activity Recognition in Smart Homes

Karpekov, Alexander, Chernova, Sonia, Plötz, Thomas

arXiv.org Artificial IntelligenceFeb-11-2025

Human Activity Recognition (HAR) using ambient sensors has great potential for practical applications, particularly in elder care and independent living. However, deploying HAR systems in real-world settings remains challenging due to the high cost of labeled data, the need for pre-segmented sensor streams, and the lack of flexibility in activity granularity. To address these limitations, we introduce DISCOVER, a method designed to discover fine-grained human sub-activities from unlabeled sensor data without relying on pre-segmentation. DISCOVER combines unsupervised feature extraction and clustering with a user-friendly visualization tool to streamline the labeling process. DISCOVER enables domain experts to efficiently annotate only a minimal set of representative cluster centroids, reducing the annotation workload to a small number of samples (0.05% of our dataset). We demonstrate DISCOVER's effectiveness through a re-annotation exercise on widely used HAR datasets, showing that it uncovers finer-grained activities and produces more nuanced annotations than traditional coarse labels. DISCOVER represents a step toward practical, deployable HAR systems that adapt to diverse real environments.

dataset, sequence, up-to-date information, (15 more...)

arXiv.org Artificial Intelligence

2503.01733

Country:

North America > Aruba (0.06)
Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.06)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology > Smart Houses & Appliances (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Add feedback

Streaming, Memory Limited Algorithms for Community Detection

Se-Young Yun, marc lelarge, Alexandre Proutiere

Neural Information Processing SystemsFeb-10-2025, 01:08:34 GMT

In this paper, we consider sparse networks consisting of a finite number of nonoverlapping communities, i.e. disjoint clusters, so that there is higher density within clusters than across clusters. Both the intra-and inter-cluster edge densities vanish when the size of the graph grows large, making the cluster reconstruction problem nosier and hence difficult to solve. We are interested in scenarios where the network size is very large, so that the adjacency matrix of the graph is hard to manipulate and store. The data stream model in which columns of the adjacency matrix are revealed sequentially constitutes a natural framework in this setting. For this model, we develop two novel clustering algorithms that extract the clusters asymptotically accurately. The first algorithm is offline, as it needs to store and keep the assignments of nodes to clusters, and requires a memory that scales linearly with the network size. The second algorithm is online, as it may classify a node when the corresponding column is revealed and then discard this information. This algorithm requires a memory growing sub-linearly with the network size. To construct these efficient streaming memory-limited clustering algorithms, we first address the problem of clustering with partial information, where only a small proportion of the columns of the adjacency matrix is observed and develop, for this setting, a new spectral algorithm which is of independent interest.

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Redondo Beach (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.72)

Add feedback

Tight Continuous Relaxation of the Balanced k-Cut Problem

Syama Sundar Rangapuram, Pramod Kaushik Mudrakarta, Matthias Hein

Neural Information Processing SystemsFeb-10-2025, 00:30:42 GMT

Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced k-cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation for any balanced k-cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the difficult sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method outperforms all existing approaches for ratio cut and other balanced k-cut criteria.

artificial intelligence, constraint, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
Europe > Germany > Saarland > Saarbrücken (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

DROP: Poison Dilution via Knowledge Distillation for Federated Learning

Syros, Georgios, Suri, Anshuman, Koushanfar, Farinaz, Nita-Rotaru, Cristina, Oprea, Alina

arXiv.org Artificial IntelligenceFeb-10-2025

Federated Learning is vulnerable to adversarial manipulation, where malicious clients can inject poisoned updates to influence the global model's behavior. While existing defense mechanisms have made notable progress, they fail to protect against adversaries that aim to induce targeted backdoors under different learning and attack configurations. To address this limitation, we introduce DROP (Distillation-based Reduction Of Poisoning), a novel defense mechanism that combines clustering and activity-tracking techniques with extraction of benign behavior from clients via knowledge distillation to tackle stealthy adversaries that manipulate low data poisoning rates and diverse malicious client ratios within the federation. Through extensive experimentation, our approach demonstrates superior robustness compared to existing defenses across a wide range of learning configurations. Finally, we evaluate existing defenses and our method under the challenging setting of non-IID client data distribution and highlight the challenges of designing a resilient FL defense in this setting.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.07011

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

evclust: Python library for evidential clustering

Soubeiga, Armel, Antoine, Violaine

arXiv.org Artificial IntelligenceFeb-10-2025

A recent developing trend in clustering is the advancement of algorithms that not only identify clusters within data, but also express and capture the uncertainty of cluster membership. Evidential clustering addresses this by using the Dempster-Shafer theory of belief functions, a framework designed to manage and represent uncertainty. This approach results in a credal partition, a structured set of mass functions that quantify the uncertain assignment of each object to potential groups. The Python framework evclust, presented in this paper, offers a suite of efficient evidence clustering algorithms as well as tools for visualizing, evaluating and analyzing credal partitions.

artificial intelligence, evclust, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.06587

Country:

North America > United States > New York (0.04)
Europe > France > Auvergne-Rhône-Alpes > Puy-de-Dôme > Clermont-Ferrand (0.04)

Genre:

Overview (0.46)
Research Report (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

Spalding-Jamieson, Jack, Robson, Eliot Wong, Zheng, Da Wei

arXiv.org Machine LearningFeb-10-2025

For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $\Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

information retrieval, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2502.06163

Country:

North America > United States > Illinois (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Czechia > Prague (0.04)
(22 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Framework for Supervised and Unsupervised Segmentation and Classification of Materials Microstructure Images

Zhang, Kungang, Apley, Daniel W., Chen, Wei, Liu, Wing K., Brinson, L. Catherine

arXiv.org Machine LearningFeb-10-2025

Microstructure of materials is often characterized through image analysis to understand processing-structure-properties linkages. We propose a largely automated framework that integrates unsupervised and supervised learning methods to classify micrographs according to microstructure phase/class and, for multiphase microstructures, segments them into different homogeneous regions. With the advance of manufacturing and imaging techniques, the ultra-high resolution of imaging that reveals the complexity of microstructures and the rapidly increasing quantity of images (i.e., micrographs) enables and necessitates a more powerful and automated framework to extract materials characteristics and knowledge. The framework we propose can be used to gradually build a database of microstructure classes relevant to a particular process or group of materials, which can help in analyzing and discovering/identifying new materials. The framework has three steps: (1) segmentation of multiphase micrographs through a recently developed score-based method so that different microstructure homogeneous regions can be identified in an unsupervised manner; (2) {identification and classification of} homogeneous regions of micrographs through an uncertainty-aware supervised classification network trained using the segmented micrographs from Step $1$ with their identified labels verified via the built-in uncertainty quantification and minimal human inspection; (3) supervised segmentation (more powerful than the segmentation in Step $1$) of multiphase microstructures through a segmentation network trained with micrographs and the results from Steps $1$-$2$ using a form of data augmentation. This framework can iteratively characterize/segment new homogeneous or multiphase materials while expanding the database to enhance performance. The framework is demonstrated on various sets of materials and texture images.

artificial intelligence, machine learning, micrograph, (16 more...)

arXiv.org Machine Learning

2502.07107

Country:

Asia > Middle East > Jordan (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre:

Research Report (0.82)
Workflow (0.52)

Industry:

Materials (0.88)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback