AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A Statistical Framework for Ranking LLM-Based Chatbots

Ameli, Siavash, Zhuang, Siyuan, Stoica, Ion, Mahoney, Michael W.

arXiv.org Machine LearningDec-24-2024

Large language models (LLMs) have transformed natural language processing, with frameworks like Chatbot Arena providing pioneering platforms for evaluating these models. By facilitating millions of pairwise comparisons based on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation, offering rich datasets for ranking models in open-ended conversational tasks. Building upon this foundation, we propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis. First, we introduce a factored tie model that enhances the ability to handle ties -- an integral aspect of human-judged comparisons -- significantly improving the model's fit to observed data. Second, we extend the framework to model covariance between competitors, enabling deeper insights into performance relationships and facilitating intuitive groupings into performance tiers. Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints, ensuring stable and interpretable parameter estimation. Through rigorous evaluation and extensive experimentation, our framework demonstrates substantial improvements over existing methods in modeling pairwise comparison data. To support reproducibility and practical adoption, we release leaderbot, an open-source Python package implementing our models and analyses.

large language model, machine learning, natural language, (22 more...)

arXiv.org Machine Learning

2412.18407

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > Wales (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.45)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Applying LLM and Topic Modelling in Psychotherapeutic Contexts

Vanin, Alexander, Bolshev, Vadim, Panfilova, Anastasia

arXiv.org Artificial IntelligenceDec-23-2024

This study explores the use of LLM (Large language models) to analyze therapist remarks in a psychotherapeutic setting. The paper focuses on the application of BERTopic, a machine learning-based topic modeling tool, to the dialogue of two different groups of therapists--classical and modern--which makes it possible to identify and describe a set of topics that consistently emerge across these groups. The paper describes in detail the chosen algorithm for BERTopic, which included creating a vector space from a corpus of therapist remarks, reducing its dimensionality, clustering the space, and creating and optimizing topic representation. Along with the automatic topical modeling by the BERTopic, the research involved an expert assessment of the findings and manual topic structure optimization. The topic modeling results highlighted the most common and stable topics in therapists' speech, offering insights into how language patterns in therapy develop and remain stable across different therapeutic styles. This work contributes to the growing field of machine learning in psychotherapy by demonstrating the potential of automated methods to improve both the practice and training of therapists. The study highlights the value of topic modeling as a tool for gaining a deeper understanding of therapeutic dialogue and offers new opportunities for improving therapeutic effectiveness and clinical supervision.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.17449

Country: Asia > Middle East > Jordan (0.04)

Genre:

Personal > Interview (1.00)
Research Report > New Finding (0.67)

Industry:

Education (1.00)
Health & Medicine > Consumer Health (0.67)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Subtopic-aware View Sampling and Temporal Aggregation for Long-form Document Matching

Zhou, Youchao, Huang, Heyan, Wu, Zhijing, Liu, Yuhang, Wang, Xinglin

arXiv.org Artificial IntelligenceDec-23-2024

Long-form document matching aims to judge the relevance between two documents and has been applied to various scenarios. Most existing works utilize hierarchical or long context models to process documents, which achieve coarse understanding but may ignore details. Some researchers construct a document view with similar sentences about aligned document subtopics to focus on detailed matching signals. However, a long document generally contains multiple subtopics. The matching signals are heterogeneous from multiple topics. Considering only the homologous aligned subtopics may not be representative enough and may cause biased modeling. In this paper, we introduce a new framework to model representative matching signals. First, we propose to capture various matching signals through subtopics of document pairs. Next, We construct multiple document views based on subtopics to cover heterogeneous and valuable details. However, existing spatial aggregation methods like attention, which integrate all these views simultaneously, are hard to integrate heterogeneous information. Instead, we propose temporal aggregation, which effectively integrates different views gradually as the training progresses. Experimental results show that our learning framework is effective on several document-matching tasks, including news duplication and legal case retrieval.

data mining, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2412.07573

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(9 more...)

Genre: Research Report > New Finding (0.48)

Industry: Law (0.89)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(3 more...)

Add feedback

An Adaptive Framework for Multi-View Clustering Leveraging Conditional Entropy Optimization

Li, Lijian

arXiv.org Artificial IntelligenceDec-23-2024

Multi-view clustering (MVC) has emerged as a powerful technique for extracting valuable insights from data characterized by multiple perspectives or modalities. Despite significant advancements, existing MVC methods struggle with effectively quantifying the consistency and complementarity among views, and are particularly susceptible to the adverse effects of noisy views, known as the Noisy-View Drawback (NVD). To address these challenges, we propose CE-MVC, a novel framework that integrates an adaptive weighting algorithm with a parameter-decoupled deep model. Leveraging the concept of conditional entropy and normalized mutual information, CE-MVC quantitatively assesses and weights the informative contribution of each view, facilitating the construction of robust unified representations. The parameter-decoupled design enables independent processing of each view, effectively mitigating the influence of noise and enhancing overall clustering performance. Extensive experiments demonstrate that CE-MVC outperforms existing approaches, offering a more resilient and accurate solution for multi-view clustering tasks.

artificial intelligence, information, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2412.17647

Country: North America > United States > New York (0.29)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.46)

Add feedback

Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and Implementation

Hyun, Jake

arXiv.org Artificial IntelligenceDec-23-2024

Clustering is a key task in machine learning, with $k$-means being widely used for its simplicity and effectiveness. While 1D clustering is common, existing methods often fail to exploit the structure of 1D data, leading to inefficiencies. This thesis introduces optimized algorithms for $k$-means++ initialization and Lloyd's algorithm, leveraging sorted data, prefix sums, and binary search for improved computational performance. The main contributions are: (1) an optimized $k$-cluster algorithm achieving $O(l \cdot k^2 \cdot \log n)$ complexity for greedy $k$-means++ initialization and $O(i \cdot k \cdot \log n)$ for Lloyd's algorithm, where $l$ is the number of greedy $k$-means++ local trials, and $i$ is the number of Lloyd's algorithm iterations, and (2) a binary search-based two-cluster algorithm, achieving $O(\log n)$ runtime with deterministic convergence to a Lloyd's algorithm local minimum. Benchmarks demonstrate over a 4500x speedup compared to scikit-learn for large datasets while maintaining clustering quality measured by within-cluster sum of squares (WCSS). Additionally, the algorithms achieve a 300x speedup in an LLM quantization task, highlighting their utility in emerging applications. This thesis bridges theory and practice for 1D $k$-means clustering, delivering efficient and sound algorithms implemented in a JIT-optimized open-source Python library.

centroid, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.15295

Country: North America > United States > California (0.14)

Genre:

Research Report > Promising Solution (0.40)
Overview > Innovation (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Unifying Family of Data-Adaptive Partitioning Algorithms

Oldaker, Guy B. IV, Emelianenko, Maria

arXiv.org Artificial IntelligenceDec-21-2024

Clustering algorithms remain valuable tools for grouping and summarizing the most important aspects of data. Example areas where this is the case include image segmentation, dimension reduction, signals analysis, model order reduction, numerical analysis, and others. As a consequence, many clustering approaches have been developed to satisfy the unique needs of each particular field. In this article, we present a family of data-adaptive partitioning algorithms that unifies several well-known methods (e.g., k-means and k-subspaces). Indexed by a single parameter and employing a common minimization strategy, the algorithms are easy to use and interpret, and scale well to large, high-dimensional problems. In addition, we develop an adaptive mechanism that (a) exhibits skill at automatically uncovering data structures and problem parameters without any expert knowledge and, (b) can be used to augment other existing methods. By demonstrating the performance of our methods on examples from disparate fields including subspace clustering, model order reduction, and matrix approximation, we hope to highlight their versatility and potential for extending the boundaries of existing scientific domains. We believe our family's parametrized structure represents a synergism of algorithms that will foster new developments and directions, not least within the data science community.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2412.16713

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

SODor: Long-Term EEG Partitioning for Seizure Onset Detection

Chen, Zheng, Matsubara, Yasuko, Sakurai, Yasushi, Sun, Jimeng

arXiv.org Artificial IntelligenceDec-20-2024

Deep learning models have recently shown great success in classifying epileptic patients using EEG recordings. Unfortunately, classification-based methods lack a sound mechanism to detect the onset of seizure events. In this work, we propose a two-stage framework, \method, that explicitly models seizure onset through a novel task formulation of subsequence clustering. Given an EEG sequence, the framework first learns a set of second-level embeddings with label supervision. It then employs model-based clustering to explicitly capture long-term temporal dependencies in EEG sequences and identify meaningful subsequences. Epochs within a subsequence share a common cluster assignment (normal or seizure), with cluster or state transitions representing successful onset detections. Extensive experiments on three datasets demonstrate that our method can correct misclassifications, achieving 5%-11% classification improvements over other baselines and accurately detecting seizure onsets.

artificial intelligence, machine learning, subsequence, (16 more...)

arXiv.org Artificial Intelligence

2412.15598

Country:

North America > United States > Illinois > Champaign County > Urbana (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Statistical Modeling of Univariate Multimodal Data

Chasani, Paraskevi, Likas, Aristidis

arXiv.org Machine LearningDec-20-2024

Unimodality constitutes a key property indicating grouping behavior of the data around a single mode of its density. We propose a method that partitions univariate data into unimodal subsets through recursive splitting around valley points of the data density. For valley point detection, we introduce properties of critical points on the convex hull of the empirical cumulative density function (ecdf) plot that provide indications on the existence of density valleys. Next, we apply a unimodal data modeling approach that provides a statistical model for each obtained unimodal subset in the form of a Uniform Mixture Model (UMM). Consequently, a hierarchical statistical model of the initial dataset is obtained in the form of a mixture of UMMs, named as the Unimodal Mixture Model (UDMM). The proposed method is non-parametric, hyperparameter-free, automatically estimates the number of unimodal subsets and provides accurate statistical models as indicated by experimental results on clustering and density estimation tasks.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Machine Learning

2412.15894

Country:

North America > United States > Wyoming (0.04)
North America > United States > New York (0.04)
Europe > Greece > Epirus > Ioannina (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Personalized Clustering via Targeted Representation Learning

Geng, Xiwen, Zhao, Suyun, Yu, Yixin, Peng, Borui, Du, Pan, Chen, Hong, Li, Cuiping, Wang, Mengdie

arXiv.org Artificial IntelligenceDec-20-2024

Clustering traditionally aims to reveal a natural grouping structure within unlabeled data. However, this structure may not always align with users' preferences. In this paper, we propose a personalized clustering method that explicitly performs targeted representation learning by interacting with users via modicum task information (e.g., $\textit{must-link}$ or $\textit{cannot-link}$ pairs) to guide the clustering direction. We query users with the most informative pairs, i.e., those pairs most hard to cluster and those most easy to miscluster, to facilitate the representation learning in terms of the clustering preference. Moreover, by exploiting attention mechanism, the targeted representation is learned and augmented. By leveraging the targeted representation and constrained contrastive loss as well, personalized clustering is obtained. Theoretically, we verify that the risk of personalized clustering is tightly bounded, guaranteeing that active queries to users do mitigate the clustering risk. Experimentally, extensive results show that our method performs well across different clustering tasks and datasets, even when only a limited number of queries are available.

artificial intelligence, machine learning, orientation, (15 more...)

arXiv.org Artificial Intelligence

2412.1369

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > China (0.05)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering

Keraghel, Imed, Nadif, Mohamed

arXiv.org Artificial IntelligenceDec-19-2024

Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2412.14867

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Portugal (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback