Goto

Collaborating Authors

 Clustering


Consistent line clustering using geometric hypergraphs

arXiv.org Machine Learning

Traditional data analysis often represents data as a weighted graph with pairwise similarities, but many problems do not naturally fit this framework. In line clustering, points in a Euclidean space must be grouped so that each cluster is well approximated by a line segment. Since any two points define a line, pairwise similarities fail to capture the structure of the problem, necessitating the use of higher-order interactions modeled by geometric hypergraphs. We encode geometry into a 3-uniform hypergraph by treating sets of three points as hyperedges whenever they are approximately collinear. The resulting hypergraph contains information about the underlying line segments, which can then be extracted using community recovery algorithms. In contrast to classical hypergraph block models, latent geometric constraints in this construction introduce significant dependencies between hyperedges, which restricts the applicability of many standard theoretical tools. We aim to determine the fundamental limits of line clustering and evaluate hypergraph-based line clustering methods. To this end, we derive information-theoretic thresholds for exact and almost exact recovery for data generated from intersecting lines on a plane with additive Gaussian noise. We develop a polynomial-time spectral algorithm and show that it succeeds under noise conditions that match the information-theoretic bounds up to a polylogarithmic factor.


M-learner:A Flexible And Powerful Framework To Study Heterogeneous Treatment Effect In Mediation Model

arXiv.org Machine Learning

We propose a novel method, termed the M-learner, for estimating heterogeneous indirect and total treatment effects and identifying relevant subgroups within a mediation framework. The procedure comprises four key steps. First, we compute individual-level conditional average indirect/total treatment effect Second, we construct a distance matrix based on pairwise differences. Third, we apply tSNE to project this matrix into a low-dimensional Euclidean space, followed by K-means clustering to identify subgroup structures. Finally, we calibrate and refine the clusters using a threshold-based procedure to determine the optimal configuration. To the best of our knowledge, this is the first approach specifically designed to capture treatment effect heterogeneity in the presence of mediation. Experimental results validate the robustness and effectiveness of the proposed framework. Application to the real-world Jobs II dataset highlights the broad adaptability and potential applicability of our method.Code is available at https: //anonymous.4open.science/r/M-learner-C4BB.


From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

arXiv.org Artificial Intelligence

Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. We then fine-tune LLMs on these datasets to assess the six diversity-control strategies. Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies. These findings offer actionable insights for constructing high-performance SFT datasets.


Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach

arXiv.org Artificial Intelligence

While subgroup disparities and performance bias are increasingly studied in computational research, fairness in categorical Speech Emotion Recognition (SER) remains underexplored. Existing methods often rely on explicit demographic labels, which are difficult to obtain due to privacy concerns. To address this limitation, we introduce an Implicit Demography Inference (IDI) module that leverages pseudo-labeling from a pre-trained model and unsupervised learning using k-means clustering to mitigate bias in SER. Our experiments show that pseudo-labeling IDI reduces subgroup disparities, improving fairness metrics by over 28% with less than a 2% decrease in SER accuracy. Also, the unsupervised IDI yields more than a 4.6% improvement in fairness metrics with a drop of less than 3.6% in SER performance. Further analyses reveal that the unsupervised IDI consistently mitigates race and age disparities, demonstrating its potential when explicit demographic information is unavailable.


PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder

arXiv.org Artificial Intelligence

Semantic Text Embedding is a fundamental NLP task that encodes textual content into vector representations, where proximity in the embedding space reflects semantic similarity. While existing embedding models excel at capturing general meaning, they often overlook ideological nuances, limiting their effectiveness in tasks that require an understanding of political bias. To address this gap, we introduce PRISM, the first framework designed to Produce inteRpretable polItical biaS eMbeddings. PRISM operates in two key stages: (1) Controversial Topic Bias Indicator Mining, which systematically extracts fine-grained political topics and their corresponding bias indicators from weakly labeled news data, and (2) Cross-Encoder Political Bias Embedding, which assigns structured bias scores to news articles based on their alignment with these indicators. This approach ensures that embeddings are explicitly tied to bias-revealing dimensions, enhancing both interpretability and predictive power. Through extensive experiments on two large-scale datasets, we demonstrate that PRISM outperforms state-of-the-art text embedding models in political bias classification while offering highly interpretable representations that facilitate diversified retrieval and ideological analysis. The source code is available at https://github.com/dukesun99/ACL-PRISM.


Anomaly Detection and Improvement of Clusters using Enhanced K-Means Algorithm

arXiv.org Artificial Intelligence

This paper introduces a unified approach to cluster refinement and anomaly detection in datasets. We propose a novel algorithm that iteratively reduces the intra-cluster variance of N clusters until a global minimum is reached, yielding tighter clusters than the standard k-means algorithm. We evaluate the method using intrinsic measures for unsupervised learning, including the silhouette coefficient, Calinski-Harabasz index, and Davies-Bouldin index, and extend it to anomaly detection by identifying points whose assignment causes a significant variance increase. External validation on synthetic data and the UCI Breast Cancer and UCI Wine Quality datasets employs the Jaccard similarity score, V-measure, and F1 score. Results show variance reductions of 18.7% and 88.1% on the synthetic and Wine Quality datasets, respectively, along with accuracy and F1 score improvements of 22.5% and 20.8% on the Wine Quality dataset.


LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

arXiv.org Artificial Intelligence

Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. While prior reviews have addressed these issues, they often focus on individual limitations or consider them within the broader context of evaluating overall model performance. This survey addresses the gap by presenting a data-driven, semi-automated review of research on limitations of LLMs (LLLMs) from 2022 to 2025, using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we extract 14,648 relevant limitation papers using keyword filtering and LLM-based classification, validated against expert labels. Using topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM), we identify between 7 and 15 prominent types of limitations discussed in recent LLM research across the ACL and arXiv datasets. We find that LLM-related research increases nearly sixfold in ACL and nearly fifteenfold in arXiv between 2022 and 2025, while LLLMs research grows even faster, by a factor of over 12 in ACL and nearly 28 in arXiv. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2025. We offer a quantitative view of trends in LLM limitations research and release a dataset of annotated abstracts and a validated methodology, available at: https://github.com/a-kostikova/LLLMs-Survey.


Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs

arXiv.org Artificial Intelligence

Generalized Category Discovery (GCD) aims to classify both known and novel categories using partially labeled data that contains only known classes. Despite achieving strong performance on existing benchmarks, current textual GCD methods lack sufficient validation in realistic settings. We introduce Event-Centric GCD (EC-GCD), characterized by long, complex narratives and highly imbalanced class distributions, posing two main challenges: (1) divergent clustering versus classification groupings caused by subjective criteria, and (2) Unfair alignment for minority classes. To tackle these, we propose PaMA, a framework leveraging LLMs to extract and refine event patterns for improved cluster-class alignment. Additionally, a ranking-filtering-mining pipeline ensures balanced representation of prototypes across imbalanced categories. Evaluations on two EC-GCD benchmarks, including a newly constructed Scam Report dataset, demonstrate that PaMA outperforms prior methods with up to 12.58% H-score gains, while maintaining strong generalization on base GCD datasets.


Number of Clusters in a Dataset: A Regularized K-means Approach

arXiv.org Artificial Intelligence

Finding the number of meaningful clusters in an unlabeled dataset is important in many applications. Regularized k-means algorithm is a possible approach frequently used to find the correct number of distinct clusters in datasets. The most common formulation of the regularization function is the additive linear term $λk$, where $k$ is the number of clusters and $λ$ a positive coefficient. Currently, there are no principled guidelines for setting a value for the critical hyperparameter $λ$. In this paper, we derive rigorous bounds for $λ$ assuming clusters are {\em ideal}. Ideal clusters (defined as $d$-dimensional spheres with identical radii) are close proxies for k-means clusters ($d$-dimensional spherically symmetric distributions with identical standard deviations). Experiments show that the k-means algorithm with additive regularizer often yields multiple solutions. Thus, we also analyze k-means algorithm with multiplicative regularizer. The consensus among k-means solutions with additive and multiplicative regularizations reduces the ambiguity of multiple solutions in certain cases. We also present selected experiments that demonstrate performance of the regularized k-means algorithms as clusters deviate from the ideal assumption.


Towards a More Generalized Approach in Open Relation Extraction

arXiv.org Artificial Intelligence

Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.