AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Hypergraphs as Weighted Directed Self-Looped Graphs: Spectral Properties, Clustering, Cheeger Inequality

Li, Zihao, Fu, Dongqi, Liu, Hengyu, He, Jingrui

arXiv.org Artificial IntelligenceOct-23-2024

Hypergraphs naturally arise when studying group relations and have been widely used in the field of machine learning. There has not been a unified formulation of hypergraphs, yet the recently proposed edge-dependent vertex weights (EDVW) modeling is one of the most generalized modeling methods of hypergraphs, i.e., most existing hypergraphs can be formulated as EDVW hypergraphs without any information loss to the best of our knowledge. However, the relevant algorithmic developments on EDVW hypergraphs remain nascent: compared to spectral graph theories, the formulations are incomplete, the spectral clustering algorithms are not well-developed, and one result regarding hypergraph Cheeger Inequality is even incorrect. To this end, deriving a unified random walk-based formulation, we propose our definitions of hypergraph Rayleigh Quotient, NCut, boundary/cut, volume, and conductance, which are consistent with the corresponding definitions on graphs. Then, we prove that the normalized hypergraph Laplacian is associated with the NCut value, which inspires our HyperClus-G algorithm for spectral clustering on EDVW hypergraphs. Finally, we prove that HyperClus-G can always find an approximately linearly optimal partitioning in terms of Both NCut and conductance. Additionally, we provide extensive experiments to validate our theoretical findings from an empirical perspective.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2411.03331

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(25 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

metasnf: Meta Clustering with Similarity Network Fusion in R

Velayudhan, Prashanth S, Xu, Xiaoqiao, Kallurkar, Prajkta, Balbon, Ana Patricia, Secara, Maria T, Taback, Adam, Sabac, Denise, Chan, Nicholas, Ma, Shihao, Wang, Bo, Felsky, Daniel, Ameis, Stephanie H, Cox, Brian, Hawco, Colin, Erdman, Lauren, Wheeler, Anne L

arXiv.org Artificial IntelligenceOct-23-2024

metasnf is an R package that enables users to apply meta clustering, a method for efficiently searching a broad space of cluster solutions by clustering the solutions themselves, to clustering workflows based on similarity network fusion (SNF). SNF is a multi-modal data integration algorithm commonly used for biomedical subtype discovery. The package also contains functions to assist with cluster visualization, characterization, and validation. This package can help researchers identify SNF-derived cluster solutions that are guided by context-specific utility over context-agnostic measures of quality.

artificial intelligence, machine learning, matrix, (16 more...)

arXiv.org Artificial Intelligence

2410.17976

Country:

North America > Canada > Ontario > Toronto (0.31)
North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
North America > United States > New York (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > Experimental Study (0.47)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.95)
Health & Medicine > Health Care Providers & Services (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

An Adaptive Framework for Generating Systematic Explanatory Answer in Online Q&A Platforms

Chen, Ziyang, Wang, Xiaobin, Jiang, Yong, Liao, Jinzhi, Xie, Pengjun, Huang, Fei, Zhao, Xiang

arXiv.org Artificial IntelligenceOct-23-2024

Question Answering (QA) systems face challenges in handling complex questions that require multi-domain knowledge synthesis. The naive RAG models, although effective in information retrieval, struggle with complex questions that require comprehensive and in-depth answers. The pioneering task is defined as explanatory answer generation, which entails handling identified challenges such as the requirement for comprehensive information and logical coherence within the generated context. To address these issues, we refer to systematic thinking theory and propose SynthRAG, an innovative framework designed to enhance QA performance. SynthRAG improves on conventional models by employing adaptive outlines for dynamic content structuring, generating systematic information to ensure detailed coverage, and producing customized answers tailored to specific user inquiries. This structured approach guarantees logical coherence and thorough integration of information, yielding responses that are both insightful and methodically organized. Empirical evaluations underscore SynthRAG's effectiveness, demonstrating its superiority in handling complex questions, overcoming the limitations of naive RAG models, and significantly improving answer quality and depth. Furthermore, an online deployment on the Zhihu platform revealed that SynthRAG's answers achieved notable user engagement, with each response averaging 5.73 upvotes and surpassing the performance of 79.8% of human contributors, highlighting the practical relevance and impact of the proposed framework. Our code is available at https://github.com/czy1999/SynthRAG .

large language model, machine learning, question answering, (18 more...)

arXiv.org Artificial Intelligence

2410.17694

Country:

Europe > Austria > Vienna (0.14)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > Middle East > Jordan (0.04)
(11 more...)

Genre: Research Report (0.82)

Industry: Materials > Metals & Mining > Gold (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.90)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Dynamic User Grouping based on Location and Heading in 5G NR Systems

Pjanić, Dino, Arslantürk, Korkut Emre, Cai, Xuesong, Tufvesson, Fredrik

arXiv.org Artificial IntelligenceOct-22-2024

User grouping based on geographic location in fifth generation (5G) New Radio (NR) systems has several applications that can significantly improve network performance, user experience, and service delivery. We demonstrate how Sounding Reference Signals channel fingerprints can be used for dynamic user grouping in a 5G NR commercial deployment based on outdoor positions and heading direction employing machine learning methods such as neural networks combined with clustering methods.

antenna, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2410.19854

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Sweden > Skåne County > Lund (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

MIK: Modified Isolation Kernel for Biological Sequence Visualization, Classification, and Clustering

Ali, Sarwan, Chourasia, Prakash, Mansoor, Haris, koirala, Bipin, Patterson, Murray

arXiv.org Machine LearningOct-21-2024

The t-Distributed Stochastic Neighbor Embedding (t-SNE) has emerged as a popular dimensionality reduction technique for visualizing high-dimensional data. It computes pairwise similarities between data points by default using an RBF kernel and random initialization (in low-dimensional space), which successfully captures the overall structure but may struggle to preserve the local structure efficiently. This research proposes a novel approach called the Modified Isolation Kernel (MIK) as an alternative to the Gaussian kernel, which is built upon the concept of the Isolation Kernel. MIK uses adaptive density estimation to capture local structures more accurately and integrates robustness measures. It also assigns higher similarity values to nearby points and lower values to distant points. Comparative research using the normal Gaussian kernel, the isolation kernel, and several initialization techniques, including random, PCA, and random walk initializations, are used to assess the proposed approach (MIK). Additionally, we compare the computational efficiency of all $3$ kernels with $3$ different initialization methods. Our experimental results demonstrate several advantages of the proposed kernel (MIK) and initialization method selection. It exhibits improved preservation of the local and global structure and enables better visualization of clusters and subclusters in the embedded space. These findings contribute to advancing dimensionality reduction techniques and provide researchers and practitioners with an effective tool for data exploration, visualization, and analysis in various domains.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2410.15688

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.70)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions

Turishcheva, Polina, Hansel, Laura, Ritzert, Martin, Weis, Marissa A., Ecker, Alexander S.

arXiv.org Machine LearningOct-21-2024

Driven by advances in recording technology, large-scale high-dimensional datasets have emerged across many scientific disciplines. Especially in biology, clustering is often used to gain insights into the structure of such datasets, for instance to understand the organization of different cell types. However, clustering is known to scale poorly to high dimensions, even though the exact impact of dimensionality is unclear as current benchmark datasets are mostly two-dimensional. Here we propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets, namely that individual samples are noisy and clusters do not perfectly separate. MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST, resulting in six datasets with comparable structure but varying dimensionality. It thus offers the chance to disentangle the impact of dimensionality on clustering. Preliminary common clustering algorithm benchmarks on MNIST-Nd suggest that Leiden is the most robust for growing dimensions.

artificial intelligence, dataset, machine learning, (18 more...)

arXiv.org Machine Learning

2410.16124

Country:

Europe > Netherlands > South Holland > Leiden (0.26)
Europe > Germany > Lower Saxony > Gottingen (0.15)

Genre: Research Report (0.40)

Industry: Government (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Spatio-temporal Multivariate Cluster Evolution Analysis for Detecting and Tracking Climate Impacts

Davis, Warren L. IV, Carlson, Max, Tezaur, Irina, Bull, Diana, Peterson, Kara, Swiler, Laura

arXiv.org Artificial IntelligenceOct-21-2024

Recent years have seen a growing concern about climate change and its impacts. While Earth System Models (ESMs) can be invaluable tools for studying the impacts of climate change, the complex coupling processes encoded in ESMs and the large amounts of data produced by these models, together with the high internal variability of the Earth system, can obscure important source-to-impact relationships. This paper presents a novel and efficient unsupervised data-driven approach for detecting statistically-significant impacts and tracing spatio-temporal source-impact pathways in the climate through a unique combination of ideas from anomaly detection, clustering and Natural Language Processing (NLP). Using as an exemplar the 1991 eruption of Mount Pinatubo in the Philippines, we demonstrate that the proposed approach is capable of detecting known post-eruption impacts/events. We additionally describe a methodology for extracting meaningful sequences of post-eruption impacts/events by using NLP to efficiently mine frequent multivariate cluster evolutions, which can be used to confirm or discover the chain of physical processes between a climate source and its impact(s).

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.16544

Country:

Asia > Philippines (0.24)
Europe > Iceland (0.14)
South America > Chile (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.68)

Add feedback

Where to Build Food Banks and Pantries: A Two-Level Machine Learning Approach

Ruan, Gavin, Guo, Ziqi, Lin, Guang

arXiv.org Artificial IntelligenceOct-20-2024

Over 44 million Americans currently suffer from food insecurity, of whom 13 million are children. Across the United States, thousands of food banks and pantries serve as vital sources of food and other forms of aid for food insecure families. By optimizing food bank and pantry locations, food would become more accessible to families who desperately require it. In this work, we introduce a novel two-level optimization framework, which utilizes the K-Medoids clustering algorithm in conjunction with the Open-Source Routing Machine engine, to optimize food bank and pantry locations based on real road distances to houses and house blocks. Our proposed framework also has the adaptability to factor in considerations such as median household income using a pseudo-weighted K-Medoids algorithm. Testing conducted with California and Indiana household data, as well as comparisons with real food bank and pantry locations showed that interestingly, our proposed framework yields food pantry locations superior to those of real existing ones and saves significant distance for households, while there is a marginal penalty on the first level food bank to food pantry distance. Overall, we believe that the second-level benefits of this framework far outweigh any drawbacks and yield a net benefit result.

artificial intelligence, food bank, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2410.1542

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.15)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.06)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.06)
(4 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.94)
Food & Agriculture > Agriculture (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Dynamic Contrastive Learning for Time Series Representation

Shamba, Abdul-Kazeem, Bach, Kerstin, Taylor, Gavin

arXiv.org Artificial IntelligenceOct-20-2024

Understanding events in time series is an important task in a variety of contexts. However, human analysis and labeling are expensive and time-consuming. Therefore, it is advantageous to learn embeddings for moments in time series in an unsupervised way, which allows for good performance in classification or detection tasks after later minimal human labeling. In this paper, we propose dynamic contrastive learning (DynaCL), an unsupervised contrastive representation learning framework for time series that uses temporal adjacent steps to define positive pairs. DynaCL adopts N-pair loss to dynamically treat all samples in a batch as positive or negative pairs, enabling efficient training and addressing the challenges of complicated sampling of positives. We demonstrate that DynaCL embeds instances from time series into semantically meaningful clusters, which allows superior performance on downstream tasks on a variety of public time series datasets. Our findings also reveal that high scores on unsupervised clustering metrics do not guarantee that the representations are useful in downstream tasks. A common task in time series (TS) analysis is to split the series into many small windows and identify or label the event taking place in each window. Learning a good representation for these moments eases the time and domain expertise needed for this data annotation. Self-supervised learning, which produces descriptive and intelligible representations in natural language processing (NLP) and computer vision (CV), has emerged as a promising path for learning TS representation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.15416

Country:

North America > United States > Maryland > Anne Arundel County > Annapolis (0.04)
Europe > Norway > Central Norway > Trøndelag > Trondheim (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

A Semidefinite Relaxation Approach for Fair Graph Clustering

Baharlouei, Sina, Sabouri, Sadra

arXiv.org Machine LearningOct-19-2024

Fair graph clustering is crucial for ensuring equitable representation and treatment of diverse communities in network analysis. Traditional methods often ignore disparities among social, economic, and demographic groups, perpetuating biased outcomes and reinforcing inequalities. This study introduces fair graph clustering within the framework of the disparate impact doctrine, treating it as a joint optimization problem integrating clustering quality and fairness constraints. Given the NP-hard nature of this problem, we employ a semidefinite relaxation approach to approximate the underlying optimization problem. For up to medium-sized graphs, we utilize a singular value decomposition-based algorithm, while for larger graphs, we propose a novel algorithm based on the alternative direction method of multipliers. Unlike existing methods, our formulation allows for tuning the trade-off between clustering quality and fairness. Experimental results on graphs generated from the standard stochastic block model demonstrate the superiority of our approach in achieving an optimal accuracy-fairness trade-off compared to state-of-the-art methods.

artificial intelligence, fairness, machine learning, (15 more...)

arXiv.org Machine Learning

2410.15233

Country:

North America > United States > California > Santa Clara County > San Jose (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
(2 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback