Goto

Collaborating Authors

 Clustering


Radon-Nikod\'ym Derivative: Re-imagining Anomaly Detection from a Measure Theoretic Perspective

arXiv.org Artificial Intelligence

Which principle underpins the design of an effective anomaly detection loss function? The answer lies in the concept of \rnthm{} theorem, a fundamental concept in measure theory. The key insight is -- Multiplying the vanilla loss function with the \rnthm{} derivative improves the performance across the board. We refer to this as RN-Loss. This is established using PAC learnability of anomaly detection. We further show that the \rnthm{} derivative offers important insights into unsupervised clustering based anomaly detections as well. We evaluate our algorithm on 96 datasets, including univariate and multivariate data from diverse domains, including healthcare, cybersecurity, and finance. We show that RN-Derivative algorithms outperform state-of-the-art methods on 68\% of Multivariate datasets (based on F-1 scores) and also achieves peak F1-scores on 72\% of time series (Univariate) datasets.


Scalable Graph Condensation with Evolving Capabilities

arXiv.org Artificial Intelligence

Graph data has become a pivotal modality due to its unique ability to model relational datasets. However, real-world graph data continues to grow exponentially, resulting in a quadratic increase in the complexity of most graph algorithms as graph sizes expand. Although graph condensation (GC) methods have been proposed to address these scalability issues, existing approaches often treat the training set as static, overlooking the evolving nature of real-world graph data. This limitation leads to inefficiencies when condensing growing training sets. In this paper, we introduce GECC (Graph Evolving Clustering Condensation), a scalable graph condensation method designed to handle large-scale and evolving graph data. GECC employs a traceable and efficient approach by performing class-wise clustering on aggregated features. Furthermore, it can inherits previous condensation results as clustering centroids when the condensed graph expands, thereby attaining an evolving capability. This methodology is supported by robust theoretical foundations and demonstrates superior empirical performance. Comprehensive experiments show that GECC achieves better performance than most state-of-the-art graph condensation methods while delivering an around 1,000x speedup on large datasets.


Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

arXiv.org Artificial Intelligence

The rapid advancements in computing dramatically increase the scale and cost of training Large Language Models (LLMs). Accurately predicting downstream task performance prior to model training is crucial for efficient resource allocation, yet remains challenging due to two primary constraints: (1) the "emergence phenomenon", wherein downstream performance metrics become meaningful only after extensive training, which limits the ability to use smaller models for prediction; (2) Uneven task difficulty distributions and the absence of consistent scaling laws, resulting in substantial metric variability. Existing performance prediction methods suffer from limited accuracy and reliability, thereby impeding the assessment of potential LLM capabilities. To address these challenges, we propose a Clustering-On-Difficulty (COD) downstream performance prediction framework. COD first constructs a predictable support subset by clustering tasks based on difficulty features, strategically excluding non-emergent and non-scalable clusters. The scores on the selected subset serve as effective intermediate predictors of downstream performance on the full evaluation set. With theoretical support, we derive a mapping function that transforms performance metrics from the predictable subset to the full evaluation set, thereby ensuring accurate extrapolation of LLM downstream performance. The proposed method has been applied to predict performance scaling for a 70B LLM, providing actionable insights for training resource allocation and assisting in monitoring the training process. Notably, COD achieves remarkable predictive accuracy on the 70B LLM by leveraging an ensemble of small models, demonstrating an absolute mean deviation of 1.36% across eight important LLM evaluation benchmarks.


Utilizing Social Media Analytics to Detect Trends in Saudi Arabias Evolving Market

arXiv.org Artificial Intelligence

Saudi Arabia faced a swift economic growth and societal transformation under Vision 2030. This offers a unique opportunity to track emerging trends in the region, which will ultimately pave the way for new business and investment possibilities. This paper explores how AI and social media analytics can identify and track trends across sectors such as construction, food and beverage, tourism, technology, and entertainment thereby helping the businesses make informed decisions. By leveraging a tailored AI-driven methodology, we analyzed millions of social media posts each month, classifying discussions and calculating scores to track the trends. The approach not only uncovered the emerging trends but also shows diminishing trends. Our methodology is able to predict the emergence and growth of trends by utilizing social media data. This approach has potential for adaptation in other regions. Ultimately, our findings highlight how ongoing, AI-powered trend analysis can enable more effective, data-informed business and development strategies in an increasingly dynamic environment.


Characterizing Structured versus Unstructured Environments based on Pedestrians' and Vehicles' Motion Trajectories

arXiv.org Artificial Intelligence

Trajectory behaviours of pedestrians and vehicles operating close to each other can be different in unstructured compared to structured environments. These differences in the motion behaviour are valuable to be considered in the trajectory prediction algorithm of an autonomous vehicle. However, the available datasets on pedestrians' and vehicles' trajectories that are commonly used as benchmarks for trajectory prediction have not been classified based on the nature of their environment. On the other hand, the definitions provided for unstructured and structured environments are rather qualitative and hard to be used for justifying the type of a given environment. In this paper, we have compared different existing datasets based on a couple of extracted trajectory features, such as mean speed and trajectory variability. Through K-means clustering and generalized linear models, we propose more quantitative measures for distinguishing the two different types of environments. Our results show that features such as trajectory variability, stop fraction and density of pedestrians are different among the two environmental types and can be used to classify the existing datasets.


Moving Past Single Metrics: Exploring Short-Text Clustering Across Multiple Resolutions

arXiv.org Machine Learning

Cluster number is typically a parameter selected at the outset in clustering problems, and while impactful, the choice can often be difficult to justify. Inspired by bioinformatics, this study examines how the nature of clusters varies with cluster number, presenting a method for determining cluster robustness, and providing a systematic method for deciding on the cluster number. The study focuses specifically on short-text clustering, involving 30,000 political Twitter bios, where the sparse co-occurrence of words between texts makes finding meaningful clusters challenging. A metric of proportional stability is introduced to uncover the stability of specific clusters between cluster resolutions, and the results are visualised using Sankey diagrams to provide an interrogative tool for understanding the nature of the dataset. The visualisation provides an intuitive way to track cluster subdivision and reorganisation as cluster number increases, offering insights that static, single-resolution metrics cannot capture. The results show that instead of seeking a single 'optimal' solution, choosing a cluster number involves balancing informativeness and complexity.


UNCA: A Neutrosophic-Based Framework for Robust Clustering and Enhanced Data Interpretation

arXiv.org Artificial Intelligence

Accurately representing the complex linkages and inherent uncertainties included in huge datasets is still a major difficulty in the field of data clustering. We address these issues with our proposed Unified Neutrosophic Clustering Algorithm (UNCA), which combines a multifaceted strategy with Neutrosophic logic to improve clustering performance. UNCA starts with a full-fledged similarity examination via a {\lambda}-cutting matrix that filters meaningful relationships between each two points of data. Then, we initialize centroids for Neutrosophic K-Means clustering, where the membership values are based on their degrees of truth, indeterminacy and falsity. The algorithm then integrates with a dynamic network visualization and MST (Minimum Spanning Tree) so that a visual interpretation of the relationships between the clusters can be clearly represented. UNCA employs SingleValued Neutrosophic Sets (SVNSs) to refine cluster assignments, and after fuzzifying similarity measures, guarantees a precise clustering result. The final step involves solidifying the clustering results through defuzzification methods, offering definitive cluster assignments. According to the performance evaluation results, UNCA outperforms conventional approaches in several metrics: it achieved a Silhouette Score of 0.89 on the Iris Dataset, a Davies-Bouldin Index of 0.59 on the Wine Dataset, an Adjusted Rand Index (ARI) of 0.76 on the Digits Dataset, and a Normalized Mutual Information (NMI) of 0.80 on the Customer Segmentation Dataset. These results demonstrate how UNCA enhances interpretability and resilience in addition to improving clustering accuracy when contrasted with Fuzzy C-Means (FCM), Neutrosophic C-Means (NCM), as well as Kernel Neutrosophic C-Means (KNCM). This makes UNCA a useful tool for complex data processing tasks


In-context learning of evolving data streams with tabular foundational models

arXiv.org Artificial Intelligence

State-of-the-art data stream mining in supervised classification has traditionally relied on ensembles of incremental decision trees. However, the emergence of large tabular models, i.e., transformers designed for structured numerical data, marks a significant paradigm shift. These models move beyond traditional weight updates, instead employing in-context learning through prompt tuning. By using on-the-fly sketches to summarize unbounded streaming data, one can feed this information into a pre-trained model for efficient processing. This work bridges advancements from both areas, highlighting how transformers' implicit meta-learning abilities, pre-training on drifting natural data, and reliance on context optimization directly address the core challenges of adaptive learning in dynamic environments. Exploring real-time model adaptation, this research demonstrates that TabPFN, coupled with a simple sliding memory strategy, consistently outperforms ensembles of Hoeffding trees across all non-stationary benchmarks. Several promising research directions are outlined in the paper. The authors urge the community to explore these ideas, offering valuable opportunities to advance in-context stream learning.


An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

arXiv.org Artificial Intelligence

Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties. The quality of rice during cultivation is primarily determined by its cultivar and characteristics. Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors. However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency. This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques. The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation. The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China. Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task. Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system.


Advanced Text Analytics -- Graph Neural Network for Fake News Detection in Social Media

arXiv.org Artificial Intelligence

Traditional Graph Neural Network (GNN) approaches for fake news detection (FND) often depend on auxiliary, non-textual data such as user interaction histories or content dissemination patterns. However, these data sources are not always accessible, limiting the effectiveness and applicability of such methods. Additionally, existing models frequently struggle to capture the detailed and intricate relationships within textual information, reducing their overall accuracy. In order to address these challenges Advanced Text Analysis Graph Neural Network (ATA-GNN) is proposed in this paper. The proposed model is designed to operate solely on textual data. ATA-GNN employs innovative topic modelling (clustering) techniques to identify typical words for each topic, leveraging multiple clustering dimensions to achieve a comprehensive semantic understanding of the text. This multi-layered design enables the model to uncover intricate textual patterns while contextualizing them within a broader semantic framework, significantly enhancing its interpretative capabilities. Extensive evaluations on widely used benchmark datasets demonstrate that ATA-GNN surpasses the performance of current GNN-based FND methods. These findings validate the potential of integrating advanced text clustering within GNN architectures to achieve more reliable and text-focused detection solutions.