AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

CluStRE: Streaming Graph Clustering with Multi-Stage Refinement

Chhabra, Adil, Peretz, Shai Dorian, Schulz, Christian

arXiv.org Artificial IntelligenceFeb-8-2025

We present CluStRE, a novel streaming graph clustering algorithm that balances computational efficiency with high-quality clustering using multi-stage refinement. Unlike traditional in-memory clustering approaches, CluStRE processes graphs in a streaming setting, significantly reducing memory overhead while leveraging re-streaming and evolutionary heuristics to improve solution quality. Our method dynamically constructs a quotient graph, enabling modularity-based optimization while efficiently handling large-scale graphs. We introduce multiple configurations of CluStRE to provide trade-offs between speed, memory consumption, and clustering quality. Experimental evaluations demonstrate that CluStRE improves solution quality by 89.8%, operates 2.6 times faster, and uses less than two-thirds of the memory required by the state-of-the-art streaming clustering algorithm on average. Moreover, our strongest mode enhances solution quality by up to 150% on average. With this, CluStRE achieves comparable solution quality to in-memory algorithms, i.e. over 96% of the quality of clustering approaches, including Louvain, effectively bridging the gap between streaming and traditional clustering methods.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.06879

Country:

Europe > Germany (0.04)
Europe > Netherlands > South Holland > Leiden (0.04)
Asia > Nepal (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Causal Learning for Heterogeneous Subgroups Based on Nonlinear Causal Kernel Clustering

Liu, Lu, Tang, Yang, Zhang, Kexuan, Sun, Qiyu

arXiv.org Machine LearningFeb-8-2025

Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learning. To address this challenge, the nonlinear Causal Kernel Clustering method is introduced for heterogeneous subgroup causal learning, highlighting variations in causal relationships across diverse subgroups. The main component for clustering heterogeneous subgroups lies in the construction of the $u$-centered sample mapping function with the property of unbiased estimation, which assesses the differences in potential nonlinear causal relationships in various samples and supported by causal identifiability theory. Experimental results indicate that the method performs well in identifying heterogeneous subgroups and enhancing causal learning, leading to a reduction in prediction error.

artificial intelligence, causal relationship, machine learning, (11 more...)

arXiv.org Machine Learning

2501.11622

Country:

Indian Ocean (0.04)
Asia > China > Shanghai > Shanghai (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces

Armijo, Cameron, Rivas, Pablo

arXiv.org Artificial IntelligenceFeb-8-2025

This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations of single-modal approaches in this domain. This work contributes to understanding the role of Vision Transformers in analyzing online marketplaces and offers a foundation for future advancements in detecting fraudulent or illegal activities.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.05756

Country: North America > United States > Texas (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Automobiles & Trucks > Parts Supplier (0.63)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Feature Explosion: a generic optimization strategy for outlier detection algorithms

Li, Qi

arXiv.org Artificial IntelligenceFeb-8-2025

Outlier detection tasks aim at discovering potential issues or opportunities and are widely used in cybersecurity, financial security, industrial inspection, etc. To date, thousands of outlier detection algorithms have been proposed. Clearly, in real-world scenarios, such a large number of algorithms is unnecessary. In other words, a large number of outlier detection algorithms are redundant. We believe the root cause of this redundancy lies in the current highly customized (i.e., non-generic) optimization strategies. Specifically, when researchers seek to improve the performance of existing outlier detection algorithms, they have to design separate optimized versions tailored to the principles of each algorithm, leading to an ever-growing number of outlier detection algorithms. To address this issue, in this paper, we introduce the explosion from physics into the outlier detection task and propose a generic optimization strategy based on feature explosion, called OSD (Optimization Strategy for outlier Detection algorithms). In the future, when improving the performance of existing outlier detection algorithms, it will be sufficient to invoke the OSD plugin without the need to design customized optimized versions for them. We compared the performances of 14 outlier detection algorithms on 24 datasets before and after invoking the OSD plugin. The experimental results show that the performances of all outlier detection algorithms are improved on almost all datasets. In terms of average accuracy, OSD make these outlier detection algorithms improve by 15% (AUC), 63.7% (AP).

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.05496

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Taiwan > Taiwan Province > Taipei (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing SystemsFeb-7-2025, 01:16:36 GMT

The paper starts by situating the problem and motivating their approach - essentially, enabling embeddings for weighted graphs by extending the original implicit embedding performed in Lovasz "Shannon capacity of a graph" paper, which operated on binary graphs. The paper then presents the connection between kernel machines and the Lovasz number in the unweighted case, and extends previous work for vertex-weighted, edge-weighted, and LS-labelled graphs. Section 3 provides practical details for computation, and section 4 motivates the use of their approach for the clustering problem - setting the number of clusters by using their \vartheta 1 bound, and initialising the clusters by starting by vertices with large alpha_i values. Finally, they show results on max-cut, clustering, overlapping clustering, and summarization tasks. The paper ties together very different work to propose a coherent approach to graph embedding. The contributions are clearly laid out, and the references to previous work is well established and used.

author feedback and meta-review, graph, truncation order, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.57)

Add feedback

Computing and Learning on Combinatorial Data

Zhang, Simon

arXiv.org Artificial IntelligenceFeb-7-2025

The twenty-first century is a data-driven era where human activities and behavior, physical phenomena, scientific discoveries, technology advancements, and almost everything that happens in the world resulting in massive generation, collection, and utilization of data. Connectivity in data is a crucial property. A straightforward example is the World Wide Web, where every webpage is connected to other web pages through hyperlinks, providing a form of directed connectivity. Combinatorial data refers to combinations of data items based on certain connectivity rules. Other forms of combinatorial data include social networks, meshes, community clusters, set systems, and molecules. This Ph.D. dissertation focuses on learning and computing with combinatorial data. We study and examine topological and connectivity features within and across connected data to improve the performance of learning and achieve high algorithmic efficiency.

data mining, machine learning, programming language, (25 more...)

arXiv.org Artificial Intelligence

2502.05063

Country:

North America > United States > Texas (0.27)
North America > United States > Indiana (0.27)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Education (0.92)
Government (0.92)
(3 more...)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
(11 more...)

Add feedback

Online Robot Motion Planning Methodology Guided by Group Social Proxemics Feature

Mu, Xuan, Liu, Xiaorui, Guo, Shuai, Chi, Wenzheng, Wang, Wei, Ge, Shuzhi Sam

arXiv.org Artificial IntelligenceFeb-7-2025

Nowadays robot is supposed to demonstrate human-like perception, reasoning and behavior pattern in social or service application. However, most of the existing motion planning methods are incompatible with above requirement. A potential reason is that the existing navigation algorithms usually intend to treat people as another kind of obstacle, and hardly take the social principle or awareness into consideration. In this paper, we attempt to model the proxemics of group and blend it into the scenario perception and navigation of robot. For this purpose, a group clustering method considering both social relevance and spatial confidence is introduced. It can enable robot to identify individuals and divide them into groups. Next, we propose defining the individual proxemics within magnetic dipole model, and further established the group proxemics and scenario map through vector-field superposition. On the basis of the group clustering and proxemics modeling, we present the method to obtain the optimal observation positions (OOPs) of group. Once the OOPs grid and scenario map are established, a heuristic path is employed to generate path that guide robot cruising among the groups for interactive purpose. A series of experiments are conducted to validate the proposed methodology on the practical robot, the results have demonstrated that our methodology has achieved promising performance on group recognition accuracy and path-generation efficiency. This concludes that the group awareness evolved as an important module to make robot socially behave in the practical scenario.

artificial intelligence, machine learning, robot, (16 more...)

arXiv.org Artificial Intelligence

2502.04837

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > China > Shandong Province > Qingdao (0.06)
Asia > China > Hong Kong (0.04)
(8 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Describing Nonstationary Data Streams in Frequency Domain

Komorniczak, Joanna

arXiv.org Artificial IntelligenceFeb-7-2025

Concept drift is among the primary challenges faced by the data stream processing methods. The drift detection strategies, designed to counteract the negative consequences of such changes, often rely on analyzing the problem metafeatures. This work presents the Frequency Filtering Metadescriptor -- a tool for characterizing the data stream that searches for the informative frequency components visible in the sample's feature vector. The frequencies are filtered according to their variance across all available data batches. The presented solution is capable of generating a metadescription of the data stream, separating chunks into groups describing specific concepts on its basis, and visualizing the frequencies in the original spatial domain. The experimental analysis compared the proposed solution with two state-of-the-art strategies and with the PCA baseline in the post-hoc concept identification task. The research is followed by the identification of concepts in the real-world data streams. The generalization in the frequency domain adapted in the proposed solution allows to capture the complex feature dependencies as a reduced number of frequency components, while maintaining the semantic meaning of data.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.04813

Country:

Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
South America > Brazil > Maranhão (0.04)
North America > United States > California (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Modular Training of Neural Networks aids Interpretability

Golechha, Satvik, Chaudhary, Maheep, Velja, Joan, Abate, Alessandro, Schoots, Nandi

arXiv.org Artificial IntelligenceFeb-6-2025

An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a "clusterability loss" function that encourages the formation of non-interacting clusters. Using automated interpretability techniques, we show that our method can help train models that are more modular and learn different, disjoint, and smaller circuits. We investigate CNNs trained on MNIST and CIFAR, small transformers trained on modular addition, and language models. Our approach provides a promising direction for training neural networks that learn simpler functions and are easier to interpret.

artificial intelligence, machine learning, modular training, (16 more...)

arXiv.org Artificial Intelligence

2502.0247

Country:

Europe > Austria > Vienna (0.14)
South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(4 more...)

Genre: Research Report (0.83)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.40)
Health & Medicine > Therapeutic Area > Immunology (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Constant-Factor Distortion Mechanisms for $k$-Committee Election

Pulyassary, Haripriya, Swamy, Chaitanya

arXiv.org Artificial IntelligenceFeb-6-2025

In the $k$-committee election problem, we wish to aggregate the preferences of $n$ agents over a set of alternatives and select a committee of $k$ alternatives that minimizes the cost incurred by the agents. While we typically assume that agent preferences are captured by a cardinal utility function, in many contexts we only have access to ordinal information, namely the agents' rankings over the outcomes. As preference rankings are not as expressive as cardinal utilities, a loss of efficiency is inevitable, and is quantified by the notion of \emph{distortion}. We study the problem of electing a $k$-committee that minimizes the sum of the $\ell$-largest costs incurred by the agents, when agents and candidates are embedded in a metric space. This problem is called the $\ell$-centrum problem and captures both the utilitarian and egalitarian objectives. When $k \geq 2$, it is not possible to compute a bounded-distortion committee using purely ordinal information. We develop the first algorithms (that we call mechanisms) for the $\ell$-centrum problem (when $k \geq 2$), which achieve $O(1)$-distortion while eliciting only a very limited amount of cardinal information via value queries. We obtain two types of query-complexity guarantees: $O(\log k \log n)$ queries \emph{per agent}, and $O(k^2 \log^2 n)$ queries \emph{in total} (while achieving $O(1)$-distortion in both cases). En route, we give a simple adaptive-sampling algorithm for the $\ell$-centrum $k$-clustering problem.

artificial intelligence, machine learning, query, (18 more...)

arXiv.org Artificial Intelligence

2501.19148

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > New York > Tompkins County > Ithaca (0.04)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback