AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Selecting the Number of Communities for Weighted Degree-Corrected Stochastic Block Models

Liu, Yucheng, Li, Xiaodong

arXiv.org Machine LearningJun-7-2024

We investigate how to select the number of communities for weighted networks without a full likelihood modeling. First, we propose a novel weighted degree-corrected stochastic block model (DCSBM), in which the mean adjacency matrix is modeled as the same as in standard DCSBM, while the variance profile matrix is assumed to be related to the mean adjacency matrix through a given variance function. Our method of selection the number of communities is based on a sequential testing framework, in each step the weighed DCSBM is fitted via some spectral clustering method. A key step is to carry out matrix scaling on the estimated variance profile matrix. The resulting scaling factors can be used to normalize the adjacency matrix, from which the testing statistic is obtained. Under mild conditions on the weighted DCSBM, our proposed procedure is shown to be consistent in estimating the true number of communities. Numerical experiments on both simulated and real network data also demonstrate the desirable empirical properties of our method.

adjacency matrix, assumption 1, matrix, (14 more...)

arXiv.org Machine Learning

2406.0534

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > Yolo County > Davis (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.34)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Spectral Toolkit of Algorithms for Graphs: Technical Report (2)

Macgregor, Peter, Sun, He

arXiv.org Artificial IntelligenceJun-6-2024

Spectral Toolkit of Algorithms for Graphs (STAG) is an open-source C++ and Python library providing several methods for working with graphs and performing graph-based data analysis. In this technical report, we provide an update on the development of the STAG library. The report serves as a user's guide for the newly implemented algorithms, and gives implementation details and engineering choices made in the development of the library. The report is structured as follows: Section 2 describes the locality sensitive hashing, and the main components used in its construction. Section 3 describes the kernel density estimation, and the state-of-the-art algorithm for the kernel density estimation.

algorithm, kernel density, stag, (13 more...)

arXiv.org Artificial Intelligence

2407.07096

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Anna Karenina Strikes Again: Pre-Trained LLM Embeddings May Favor High-Performing Learners

Schleifer, Abigail Gurin, Klebanov, Beata Beigman, Ariely, Moriah, Alexandron, Giora

arXiv.org Artificial IntelligenceJun-6-2024

Unsupervised clustering of student responses to open-ended questions into behavioral and cognitive profiles using pre-trained LLM embeddings is an emerging technique, but little is known about how well this captures pedagogically meaningful information. We investigate this in the context of student responses to open-ended questions in biology, which were previously analyzed and clustered by experts into theory-driven Knowledge Profiles (KPs). Comparing these KPs to ones discovered by purely data-driven clustering techniques, we report poor discoverability of most KPs, except for the ones including the correct answers. We trace this "discoverability bias" to the representations of KPs in the pre-trained LLM embeddings space.

correct response, proceedings, student response, (16 more...)

arXiv.org Artificial Intelligence

2406.06599

Country:

Asia > Middle East > Israel (0.04)
North America > United States (0.04)
Asia > Singapore > Central Region > Singapore (0.04)

Genre: Research Report > New Finding (0.94)

Industry:

Health & Medicine (1.00)
Education > Curriculum > Subject-Specific Education (0.95)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Contrastive explainable clustering with differential privacy

Nguyen, Dung, Vetzler, Ariel, Kraus, Sarit, Vullikanti, Anil

arXiv.org Artificial IntelligenceJun-6-2024

This paper presents a novel approach in Explainable AI (XAI), integrating contrastive explanations with differential privacy in clustering methods. For several basic clustering problems, including $k$-median and $k$-means, we give efficient differential private contrastive explanations that achieve essentially the same explanations as those that non-private clustering explanations can obtain. We define contrastive explanations as the utility difference between the original clustering utility and utility from clustering with a specifically fixed centroid. In each contrastive scenario, we designate a specific data point as the fixed centroid position, enabling us to measure the impact of this constraint on clustering utility under differential privacy. Extensive experiments across various datasets show our method's effectiveness in providing meaningful explanations without significantly compromising data privacy or clustering utility. This underscores our contribution to privacy-aware machine learning, demonstrating the feasibility of achieving a balance between privacy and utility in the explanation of clustering tasks.

algorithm, contrastive explanation, explanation, (16 more...)

arXiv.org Artificial Intelligence

2406.0461

Country:

North America > United States > Virginia > Albemarle County (0.05)
Asia > Afghanistan > Parwan Province > Charikar (0.05)
Oceania > Australia > New South Wales > Sydney (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Subspace Clustering in Wavelet Packets Domain

Kopriva, Ivica, Sersic, Damir

arXiv.org Artificial IntelligenceJun-6-2024

Subspace clustering (SC) algorithms utilize the union of subspaces model to cluster data points according to the subspaces from which they are drawn. To better address separability of subspaces and robustness to noise we propose a wavelet packet (WP) based transform domain subspace clustering. Depending on the number of resolution levels, WP yields several representations instantiated in terms of subbands. The first approach combines original and subband data into one complementary multi-view representation. Afterward, we formulate joint representation learning as a low-rank MERA tensor network approximation problem. That is motivated by the strong representation power of the MERA network to capture complex intra/inter-view dependencies in corresponding self-representation tensor. In the second approach, we use a self-stopping computationally efficient method to select the subband with the smallest clustering error on the validation set. When existing SC algorithms are applied to the chosen subband, their performance is expected to improve. Consequently, both approaches enable the re-use of SC algorithms developed so far. Improved clustering performance is due to the dual nature of subbands as representations and filters, which is essential for noise suppression. We exemplify the proposed WP domain approach to SC on the MERA tensor network and eight other well-known linear SC algorithms using six well-known image datasets representing faces, digits, and objects. Although WP domain-based SC is a linear method, it achieved clustering performance comparable with some best deep SC algorithms and outperformed many other deep SC algorithms by a significant margin. That is in particular case for the WP MERA SC algorithm. On the COIL100 dataset, it achieves an accuracy of 87.45% and outperforms the best deep SC competitor in the amount of 14.75%.

algorithm, sc algorithm, subspace, (10 more...)

arXiv.org Artificial Intelligence

2406.03819

Country:

Europe > Croatia > Zagreb County > Zagreb (0.04)
North America > United States > Texas > Bexar County > San Antonio (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.92)
Information Technology > Data Science > Data Quality > Data Transformation (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Why the Metric Backbone Preserves Community Structure

Dreveton, Maximilien, Chucri, Charbel, Grossglauser, Matthias, Thiran, Patrick

arXiv.org Artificial IntelligenceJun-6-2024

The metric backbone of a weighted graph is the union of all-pairs shortest paths. It is obtained by removing all edges $(u,v)$ that are not the shortest path between $u$ and $v$. In networks with well-separated communities, the metric backbone tends to preserve many inter-community edges, because these edges serve as bridges connecting two communities, but tends to delete many intra-community edges because the communities are dense. This suggests that the metric backbone would dilute or destroy the community structure of the network. However, this is not borne out by prior empirical work, which instead showed that the metric backbone of real networks preserves the community structure of the original network well. In this work, we analyze the metric backbone of a broad class of weighted random graphs with communities, and we formally prove the robustness of the community structure with respect to the deletion of all the edges that are not in the metric backbone. An empirical comparison of several graph sparsification techniques confirms our theoretical finding and shows that the metric backbone is an efficient sparsifier in the presence of communities.

graph, metric backbone, shortest path, (16 more...)

arXiv.org Artificial Intelligence

2406.03852

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (0.94)
Information Technology > Communications (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Cheng, Qi, Boratko, Michael, Yelugam, Pranay Kumar, O'Gorman, Tim, Singh, Nalini, McCallum, Andrew, Li, Xiang Lorraine

arXiv.org Artificial IntelligenceJun-6-2024

Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.

correlation, dataset, evaluation, (15 more...)

arXiv.org Artificial Intelligence

2406.04145

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > China > Hong Kong (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

A multi-core periphery perspective: Ranking via relative centrality

Mukherjee, Chandra Sekhar, Zhang, Jiapeng

arXiv.org Machine LearningJun-6-2024

Community and core-periphery are two widely studied graph structures, with their coexistence observed in real-world graphs (Rombach, Porter, Fowler \& Mucha [SIAM J. App. Math. 2014, SIAM Review 2017]). However, the nature of this coexistence is not well understood and has been pointed out as an open problem (Yanchenko \& Sengupta [Statistics Surveys, 2023]). Especially, the impact of inferring the core-periphery structure of a graph on understanding its community structure is not well utilized. In this direction, we introduce a novel quantification for graphs with ground truth communities, where each community has a densely connected part (the core), and the rest is more sparse (the periphery), with inter-community edges more frequent between the peripheries. Built on this structure, we propose a new algorithmic concept that we call relative centrality to detect the cores. We observe that core-detection algorithms based on popular centrality measures such as PageRank and degree centrality can show some bias in their outcome by selecting very few vertices from some cores. We show that relative centrality solves this bias issue and provide theoretical and simulation support, as well as experiments on real-world graphs. Core detection is known to have important applications with respect to core-periphery structures. In our model, we show a new application: relative-centrality-based algorithms can select a subset of the vertices such that it contains sufficient vertices from all communities, and points in this subset are better separable into their respective communities. We apply the methods to 11 biological datasets, with our methods resulting in a more balanced selection of vertices from all communities such that clustering algorithms have better performance on this set.

algorithm, centrality, vertex, (16 more...)

arXiv.org Machine Learning

2406.04487

Country:

North America > United States > California (0.14)
Europe > Netherlands > South Holland > Leiden (0.04)

Genre: Research Report (0.63)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Robust Prediction Model for Multidimensional and Unbalanced Datasets

Thakar, Pooja, Mehta, Anil, Manisha, null

arXiv.org Artificial IntelligenceJun-5-2024

Data Mining is a promising field and is applied in multiple domains for its predictive capabilities. Data in the real world cannot be readily used for data mining as it suffers from the problems of multidimensionality, unbalance and missing values. It is difficult to use its predictive capabilities by novice users. It is difficult for a beginner to find the relevant set of attributes from a large pool of data available. The paper presents a Robust Prediction Model that finds a relevant set of attributes; resolves the problems of unbalanced and multidimensional real-life datasets and helps in finding patterns for informed decision making. Model is tested upon five different datasets in the domain of Health Sector, Education, Business and Fraud Detection. The results showcase the robust behaviour of the model and its applicability in various domains.

classification, dataset, weighted mean precision, (10 more...)

arXiv.org Artificial Intelligence

2406.03507

Country:

Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
Asia > India > Rajasthan > Jaipur (0.04)
North America > United States > California > Orange County > Irvine (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Add feedback

Dynamic Spectral Clustering with Provable Approximation Guarantee

Laenen, Steinar, Sun, He

arXiv.org Artificial IntelligenceJun-5-2024

This paper studies clustering algorithms for dynamically evolving graphs $\{G_t\}$, in which new edges (and potential new vertices) are added into a graph, and the underlying cluster structure of the graph can gradually change. The paper proves that, under some mild condition on the cluster-structure, the clusters of the final graph $G_T$ of $n_T$ vertices at time $T$ can be well approximated by a dynamic variant of the spectral clustering algorithm. The algorithm runs in amortised update time $O(1)$ and query time $o(n_T)$. Experimental studies on both synthetic and real-world datasets further confirm the practicality of our designed algorithm.

algorithm, graph, vertex, (13 more...)

arXiv.org Artificial Intelligence

2406.03152

Country:

Europe > Austria > Vienna (0.14)
Europe > United Kingdom (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback