Clustering
Cascaded two-stage feature clustering and selection via separability and consistency in fuzzy decision systems
Chen, Yuepeng, Ding, Weiping, Ju, Hengrong, Huang, Jiashuang, Yin, Tao
--Feature selection is a vital technique in machine learning, as it can reduce computational complexity, improve model performance, and mitigate the risk of overfitting. However, the increasing complexity and dimensionality of datasets pose significant challenges in the selection of features. Focusing on these challenges, this paper proposes a cascaded two-stage feature clustering and selection algorithm for fuzzy decision systems. In the first stage, we reduce the search space by clustering relevant features and addressing inter-feature redundancy. In the second stage, a clustering-based sequentially forward selection method that explores the global and local structure of data is presented. We propose a novel metric for assessing the significance of features, which considers both global separability and local consistency. Global separability measures the degree of intra-class cohesion and inter-class separation based on fuzzy membership, providing a comprehensive understanding of data separability. Meanwhile, local consistency leverages the fuzzy neighborhood rough set model to capture uncertainty and fuzziness in the data. The effectiveness of our proposed algorithm is evaluated through experiments conducted on 18 public datasets and a real-world schizophrenia dataset. The experiment results demonstrate our algorithm's superiority over benchmarking algorithms in both classification accuracy and the number of selected features. Index T erms--Feature selection, fuzzy neighborhood rough set, fuzzy decision systems, granular computing. ITH the advent of the digital era, there has been an unprecedented surge in data from various sources such as sensors, social media, financial systems, and healthcare resources. However, traditional methods struggle to handle big data due to its high dimensionality, noise, and redundant information, significantly impacting the accuracy and efficiency of both data analysis and decision-making processes.
TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
Zhang, Jipeng, Qin, Yaxuan, Pi, Renjie, Zhang, Weizhong, Pan, Rui, Zhang, Tong
Instruction tuning [Wei et al., 2022a, Ouyang et al., 2022] is the most important strategy for customizing Large Language Models (LLMs) for downstream tasks, which allows them to precisely understand human intentions and accurately generate responses in natural languages. Recently, many existing works Wang et al. [2023a] expand the amount and diversity of instructions for instruction tuning to further enhance the LLM's capability. However, the increased quantity of the dataset also leads to significantly higher computational costs for instruction tuning. Meanwhile, Zhou et al. [2023] revealed that only 1,000 high-quality, human-created data samples could substantially improve the ability of LLMs to follow instructions, which suggest that there exists severe redundancy in current instruction datasets, and only a high-quality subset may suffice for achieving promising performance. To address the above issue, selecting a small, highly informative subset (i.e., coreset) of training samples from the original dataset is a promising solution. This approach ensures that training on the coreset achieves performance comparable to the full dataset while significantly reducing costs. However, coreset selection is challenging as it must not only consider the quality of individual samples, but also their importance within the entire subset. For example, if two high-quality samples are very similar, selecting only one may be sufficient. This global perspective on sample importance is crucial for the quality of the selected subset.
Addressing Data Heterogeneity in Federated Learning of Cox Proportional Hazards Models
Seidi, Navid, Roy, Satyaki, Das, Sajal K., Tripathy, Ardhendu
The diversity in disease profiles and therapeutic approaches between hospitals and health professionals underscores the need for patient-centric personalized strategies in healthcare. Alongside this, similarities in disease progression across patients can be utilized to improve prediction models in survival analysis. The need for patient privacy and the utility of prediction models can be simultaneously addressed in the framework of Federated Learning (FL). This paper outlines an approach in the domain of federated survival analysis, specifically the Cox Proportional Hazards (CoxPH) model, with a specific focus on mitigating data heterogeneity and elevating model performance. We present an FL approach that employs feature-based clustering to enhance model accuracy across synthetic datasets and real-world applications, including the Surveillance, Epidemiology, and End Results (SEER) database. Furthermore, we consider an event-based reporting strategy that provides a dynamic approach to model adaptation by responding to local data changes. Our experiments show the efficacy of our approach and discuss future directions for a practical application of FL in healthcare.
Voices in a Crowd: Searching for Clusters of Unique Perspectives
Vitsakis, Nikolas, Parekh, Amit, Konstas, Ioannis
Language models have been shown to reproduce underlying biases existing in their training data, which is the majority perspective by default. Proposed solutions aim to capture minority perspectives by either modelling annotator disagreements or grouping annotators based on shared metadata, both of which face significant challenges. We propose a framework that trains models without encoding annotator metadata, extracts latent embeddings informed by annotator behaviour, and creates clusters of similar opinions, that we refer to as voices. Resulting clusters are validated post-hoc via internal and external quantitative metrics, as well a qualitative analysis to identify the type of voice that each cluster represents. Our results demonstrate the strong generalisation capability of our framework, indicated by resulting clusters being adequately robust, while also capturing minority perspectives based on different demographic factors throughout two distinct datasets.
Multi-agent Coverage Control: From Discrete Assignments to Continuous Multi-agent Distribution Matching
The multi-agent spatial coverage control problem encompasses a broad research domain, dealing with both dynamic and static deployment strategies, discrete-task assignments, and spatial distribution-matching deployment. Coverage control may involve the deployment of a finite number of agents or a continuum through centralized or decentralized, locally-interacting schemes. All these problems can be solved via a different taxonomy of deployment algorithms for multiple agents. Depending on the application scenario, these problems involve from purely discrete descriptions of tasks (finite loads) and agents (finite resources), to a mixture of discrete and continuous elements, to fully continuous descriptions of the same. Yet, it is possible to find common features that underline all the above formulations, which we aim to illustrate here. By doing so, we aim to point the reader to novel references related to these problems. The short article outline is the following: Static coverage via concurrent area partitioning and assignment; Static coverage as a discrete task assignment; and Continuum task assignment for large-scale swarms.
From A-to-Z Review of Clustering Validation Indices
Hassan, Bryar A., Tayfor, Noor Bahjat, Hassan, Alla A., Ahmed, Aram M., Rashid, Tarik A., Abdalla, Naz N.
Data clustering involves identifying latent similarities within a dataset and organizing them into clusters or groups. The outcomes of various clustering algorithms differ as they are susceptible to the intrinsic characteristics of the original dataset, including noise and dimensionality. The effectiveness of such clustering procedures directly impacts the homogeneity of clusters, underscoring the significance of evaluating algorithmic outcomes. Consequently, the assessment of clustering quality presents a significant and complex endeavor. A pivotal aspect affecting clustering validation is the cluster validity metric, which aids in determining the optimal number of clusters. The main goal of this study is to comprehensively review and explain the mathematical operation of internal and external cluster validity indices, but not all, to categorize these indices and to brainstorm suggestions for future advancement of clustering validation research. In addition, we review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms, such as the evolutionary clustering algorithm star (ECA*). Finally, we suggest a classification framework for examining the functionality of both internal and external clustering validation measures regarding their ideal values, user-friendliness, responsiveness to input data, and appropriateness across various fields. This classification aids researchers in selecting the appropriate clustering validation measure to suit their specific requirements.
Driving pattern interpretation based on action phases clustering
Yao, Xue, Calvert, Simeon C., Hoogendoorn, Serge P.
Current approaches to identifying driving heterogeneity face challenges in comprehending fundamental patterns from the perspective of underlying driving behavior mechanisms. The concept of Action phases was proposed in our previous work, capturing the diversity of driving characteristics with physical meanings. This study presents a novel framework to further interpret driving patterns by classifying Action phases in an unsupervised manner. In this framework, a Resampling and Downsampling Method (RDM) is first applied to standardize the length of Action phases. Then the clustering calibration procedure including ''Feature Selection'', ''Clustering Analysis'', ''Difference/Similarity Evaluation'', and ''Action phases Re-extraction'' is iteratively applied until all differences among clusters and similarities within clusters reach the pre-determined criteria. Application of the framework using real-world datasets revealed six driving patterns in the I80 dataset, labeled as ''Catch up'', ''Keep away'', and ''Maintain distance'', with both ''Stable'' and ''Unstable'' states. Notably, Unstable patterns are more numerous than Stable ones. ''Maintain distance'' is the most common among Stable patterns. These observations align with the dynamic nature of driving. Two patterns ''Stable keep away'' and ''Unstable catch up'' are missing in the US101 dataset, which is in line with our expectations as this dataset was previously shown to have less heterogeneity. This demonstrates the potential of driving patterns in describing driving heterogeneity. The proposed framework promises advantages in addressing label scarcity in supervised learning and enhancing tasks such as driving behavior modeling and driving trajectory prediction.
Jigsaw Game: Federated Clustering
Xu, Jinxuan, Chen, Hong-You, Chao, Wei-Lun, Zhang, Yuqian
Federated learning has recently garnered significant attention, especially within the domain of supervised learning. However, despite the abundance of unlabeled data on end-users, unsupervised learning problems such as clustering in the federated setting remain underexplored. In this paper, we investigate the federated clustering problem, with a focus on federated k-means. We outline the challenge posed by its non-convex objective and data heterogeneity in the federated framework. To tackle these challenges, we adopt a new perspective by studying the structures of local solutions in k-means and propose a one-shot algorithm called FeCA (Federated Centroid Aggregation). FeCA adaptively refines local solutions on clients, then aggregates these refined solutions to recover the global solution of the entire dataset in a single round. We empirically demonstrate the robustness of FeCA under various federated scenarios on both synthetic and real-world data. Additionally, we extend FeCA to representation learning and present DeepFeCA, which combines Deep-Cluster and FeCA for unsupervised feature learning in the federated setting.
Out-of-Distribution Detection through Soft Clustering with Non-Negative Kernel Regression
Gulati, Aryan, Dong, Xingjian, Hurtado, Carlos, Shekkizhar, Sarath, Swayamdipta, Swabha, Ortega, Antonio
As language models become more general purpose, increased attention needs to be paid to detecting out-of-distribution (OOD) instances, i.e., those not belonging to any of the distributions seen during training. Existing methods for detecting OOD data are computationally complex and storage-intensive. We propose a novel soft clustering approach for OOD detection based on non-negative kernel regression. Our approach greatly reduces computational and space complexities (up to 11x improvement in inference time and 87% reduction in storage requirements) and outperforms existing approaches by up to 4 AUROC points on four different benchmarks. We also introduce an entropy-constrained version of our algorithm, which leads to further reductions in storage requirements (up to 97% lower than comparable approaches) while retaining competitive performance. Our soft clustering approach for OOD detection highlights its potential for detecting tail-end phenomena in extreme-scale data settings.
Profiling quantum circuits for their efficient execution on single- and multi-core architectures
Bandic, Medina, Henaff, Pablo le, Ovide, Anabel, Escofet, Pau, Rached, Sahar Ben, Rodrigo, Santiago, van Someren, Hans, Abadal, Sergi, Alarcon, Eduard, Almudever, Carmen G., Feld, Sebastian
Application-specific quantum computers offer the most efficient means to tackle problems intractable by classical computers. Realizing these architectures necessitates a deep understanding of quantum circuit properties and their relationship to execution outcomes on quantum devices. Our study aims to perform for the first time a rigorous examination of quantum circuits by introducing graph theory-based metrics extracted from their qubit interaction graph and gate dependency graph alongside conventional parameters describing the circuit itself. This methodology facilitates a comprehensive analysis and clustering of quantum circuits. Furthermore, it uncovers a connection between parameters rooted in both qubit interaction and gate dependency graphs, and the performance metrics for quantum circuit mapping, across a range of established quantum device and mapping configurations. Among the various device configurations, we particularly emphasize modular (i.e., multi-core) quantum computing architectures due to their high potential as a viable solution for quantum device scalability. This thorough analysis will help us to: i) identify key attributes of quantum circuits that affect the quantum circuit mapping performance metrics; ii) predict the performance on a specific chip for similar circuit structures; iii) determine preferable combinations of mapping techniques and hardware setups for specific circuits; and iv) define representative benchmark sets by clustering similarly structured circuits.