Goto

Collaborating Authors

 Clustering


Spatial Clustering of Citizen Science Data Improves Downstream Species Distribution Models

arXiv.org Artificial Intelligence

Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives.


Cluster-Enhanced Federated Graph Neural Network for Recommendation

arXiv.org Artificial Intelligence

Personal interaction data can be effectively modeled as individual graphs for each user in recommender systems.Graph Neural Networks (GNNs)-based recommendation techniques have become extremely popular since they can capture high-order collaborative signals between users and items by aggregating the individual graph into a global interactive graph.However, this centralized approach inherently poses a threat to user privacy and security. Recently, federated GNN-based recommendation techniques have emerged as a promising solution to mitigate privacy concerns. Nevertheless, current implementations either limit on-device training to an unaccompanied individual graphs or necessitate reliance on an extra third-party server to touch other individual graphs, which also increases the risk of privacy leakage. To address this challenge, we propose a Cluster-enhanced Federated Graph Neural Network framework for Recommendation, named CFedGR, which introduces high-order collaborative signals to augment individual graphs in a privacy preserving manner. Specifically, the server clusters the pretrained user representations to identify high-order collaborative signals. In addition, two efficient strategies are devised to reduce communication between devices and the server. Extensive experiments on three benchmark datasets validate the effectiveness of our proposed methods.


A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information

arXiv.org Artificial Intelligence

Reconstructing the physical complexity of many-body dynamical systems can be challenging. Starting from the trajectories of their constitutive units (raw data), typical approaches require selecting appropriate descriptors to convert them into time-series, which are then analyzed to extract interpretable information. However, identifying the most effective descriptor is often non-trivial. Here, we report a data-driven approach to compare the efficiency of various descriptors in extracting information from noisy trajectories and translating it into physically relevant insights. As a prototypical system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic system where ice and water coexist in equilibrium near the solid/liquid transition temperature. We compare general and specific descriptors often used in aqueous systems: number of neighbors, molecular velocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and Neighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from the fifth neighbor ($d_5$). Using Onion Clustering -- an efficient unsupervised method for single-point time-series analysis -- we assess the maximum extractable information for each descriptor and rank them via a high-dimensional metric. Our results show that advanced descriptors like SOAP and LENS outperform classical ones due to higher signal-to-noise ratios. Nonetheless, even simple descriptors can rival or exceed advanced ones after local signal denoising. For example, $d_5$, initially among the weakest, becomes the most effective at resolving the system's non-local dynamical complexity after denoising. This work highlights the critical role of noise in information extraction from molecular trajectories and offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity.


A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores

arXiv.org Artificial Intelligence

The increasing scale of manycore systems poses significant challenges in managing reliability while meeting performance demands. Simultaneously, these systems become more susceptible to different aging mechanisms such as negative-bias temperature instability (NBTI), hot carrier injection (HCI), and thermal cycling (TC), as well as the electromigration (EM) phenomenon. In this paper, we propose a reinforcement learning (RL)-based task mapping method to improve the reliability of manycore systems considering the aforementioned aging mechanisms, which consists of three steps including bin packing, task-to-bin mapping, and task-to-core mapping. In the initial step, a density-based spatial application with noise (DBSCAN) clustering method is employed to compose some clusters (bins) based on the cores temperature. Then, the Q-learning algorithm is used for the two latter steps, to map the arrived task on a core such that the minimum thermal variation is occurred among all the bins. Compared to the state-of-the-art works, the proposed method is performed during runtime without requiring any parameter to be calculated offline. The effectiveness of the proposed technique is evaluated on 16, 32, and 64 cores systems using SPLASH2 and PARSEC benchmark suite applications. The results demonstrate up to 27% increase in the mean time to failure (MTTF) compared to the state-of-the-art task mapping techniques.


Extended Cross-Modality United Learning for Unsupervised Visible-Infrared Person Re-identification

arXiv.org Artificial Intelligence

Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.


Online High-Frequency Trading Stock Forecasting with Automated Feature Clustering and Radial Basis Function Neural Networks

arXiv.org Artificial Intelligence

This study presents an autonomous experimental machine learning protocol for high-frequency trading (HFT) stock price forecasting that involves a dual competitive feature importance mechanism and clustering via shallow neural network topology for fast training. By incorporating the k-means algorithm into the radial basis function neural network (RBFNN), the proposed method addresses the challenges of manual clustering and the reliance on potentially uninformative features. More specifically, our approach involves a dual competitive mechanism for feature importance, combining the mean-decrease impurity (MDI) method and a gradient descent (GD) based feature importance mechanism. This approach, tested on HFT Level 1 order book data for 20 S&P 500 stocks, enhances the forecasting ability of the RBFNN regressor. Our findings suggest that an autonomous approach to feature selection and clustering is crucial, as each stock requires a different input feature space. Overall, by automating the feature selection and clustering processes, we remove the need for manual topological grid search and provide a more efficient way to predict LOB's mid-price.


Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs

arXiv.org Artificial Intelligence

This paper proposes a Clustering, Labeling, then Augmenting framework that significantly enhances performance in Semi-Supervised Text Classification (SSTC) tasks, effectively addressing the challenge of vast datasets with limited labeled examples. Unlike traditional SSTC approaches that rely on a predefined small set of labeled data to generate pseudo-labels for the unlabeled data, this framework innovatively employs clustering to select representative "landmarks" for labeling. These landmarks subsequently act as intermediaries in an ensemble of augmentation techniques, including Retrieval-Augmented Generation (RAG), Large Language Model (LLMs)-based rewriting, and synonym substitution, to generate synthetic labeled data without making pseudo-labels for the unlabeled data. Empirical results show that even in complex text document classification scenarios involving over 100 categories, our method achieves state-of-the-art accuracies of 95.41% on the Reuters dataset and 82.43% on the Web of Science dataset. Our approach significantly reduces the reliance on human labeling efforts and the associated expenses, while simultaneously ensuring high data quality and minimizing privacy risks. The finetuning results further show the efficiency of fine-tuning LLMs for text classification tasks, highlighting a robust solution for leveraging limited labeled data.


TPCH: Tensor-interacted Projection and Cooperative Hashing for Multi-view Clustering

arXiv.org Artificial Intelligence

In recent years, anchor and hash-based multi-view clustering methods have gained attention for their efficiency and simplicity in handling large-scale data. However, existing methods often overlook the interactions among multi-view data and higher-order cooperative relationships during projection, negatively impacting the quality of hash representation in low-dimensional spaces, clustering performance, and sensitivity to noise. To address this issue, we propose a novel approach named Tensor-Interacted Projection and Cooperative Hashing for Multi-View Clustering(TPCH). TPCH stacks multiple projection matrices into a tensor, taking into account the synergies and communications during the projection process. By capturing higher-order multi-view information through dual projection and Hamming space, TPCH employs an enhanced tensor nuclear norm to learn more compact and distinguishable hash representations, promoting communication within and between views. Experimental results demonstrate that this refined method significantly outperforms state-of-the-art methods in clustering on five large-scale multi-view datasets. Moreover, in terms of CPU time, TPCH achieves substantial acceleration compared to the most advanced current methods. The code is available at \textcolor{red}{\url{https://github.com/jankin-wang/TPCH}}.


DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

arXiv.org Artificial Intelligence

Fine-grained clustering is a practical yet challenging task, whose essence lies in capturing the subtle differences between instances of different classes. Such subtle differences can be easily disrupted by data augmentation or be overwhelmed by redundant information in data, leading to significant performance degradation for existing clustering methods. In this work, we introduce DiFiC a fine-grained clustering method building upon the conditional diffusion model. Distinct from existing works that focus on extracting discriminative features from images, DiFiC resorts to deducing the textual conditions used for image generation. To distill more precise and clustering-favorable object semantics, DiFiC further regularizes the diffusion target and guides the distillation process utilizing neighborhood similarity. Extensive experiments demonstrate that DiFiC outperforms both state-of-the-art discriminative and generative clustering methods on four fine-grained image clustering benchmarks. We hope the success of DiFiC will inspire future research to unlock the potential of diffusion models in tasks beyond generation. The code will be released.


Data clustering: an essential technique in data science

arXiv.org Artificial Intelligence

This paper provides a comprehensive exploration of data clustering, emphasizing its methodologies and applications across different fields. Traditional techniques, including partitional and hierarchical clustering, are discussed alongside other approaches such as data stream, subspace and network clustering, highlighting their role in addressing complex, high-dimensional datasets. The paper also reviews the foundational principles of clustering, introduces common tools and methods, and examines its diverse applications in data science. Finally, the discussion concludes with insights into future directions, underscoring the centrality of clustering in driving innovation and enabling data-driven decision making.