AITopics

doi: 10.1109/TFUZZ.2024.3420963

2407.15893

Country:

North America > Canada > Alberta (0.14)
Asia > China > Jiangsu Province > Nanjing (0.04)
Oceania > Australia > New South Wales > Wollongong (0.04)
(4 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.47)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

arXiv.org Artificial IntelligenceJul-21-2024

TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Zhang, Jipeng, Qin, Yaxuan, Pi, Renjie, Zhang, Weizhong, Pan, Rui, Zhang, Tong

Instruction tuning [Wei et al., 2022a, Ouyang et al., 2022] is the most important strategy for customizing Large Language Models (LLMs) for downstream tasks, which allows them to precisely understand human intentions and accurately generate responses in natural languages. Recently, many existing works Wang et al. [2023a] expand the amount and diversity of instructions for instruction tuning to further enhance the LLM's capability. However, the increased quantity of the dataset also leads to significantly higher computational costs for instruction tuning. Meanwhile, Zhou et al. [2023] revealed that only 1,000 high-quality, human-created data samples could substantially improve the ability of LLMs to follow instructions, which suggest that there exists severe redundancy in current instruction datasets, and only a high-quality subset may suffice for achieving promising performance. To address the above issue, selecting a small, highly informative subset (i.e., coreset) of training samples from the original dataset is a promising solution. This approach ensures that training on the coreset achieves performance comparable to the full dataset while significantly reducing costs. However, coreset selection is challenging as it must not only consider the quality of individual samples, but also their importance within the entire subset. For example, if two high-quality samples are very similar, selecting only one may be sufficient. This global perspective on sample importance is crucial for the quality of the selected subset.

dataset, instruction, selection, (16 more...)

2407.15235

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Seidi, Navid, Roy, Satyaki, Das, Sajal K., Tripathy, Ardhendu

Addressing Data Heterogeneity in Federated Learning of Cox Proportional Hazards Models

arXiv.org Machine LearningJul-20-2024

The diversity in disease profiles and therapeutic approaches between hospitals and health professionals underscores the need for patient-centric personalized strategies in healthcare. Alongside this, similarities in disease progression across patients can be utilized to improve prediction models in survival analysis. The need for patient privacy and the utility of prediction models can be simultaneously addressed in the framework of Federated Learning (FL). This paper outlines an approach in the domain of federated survival analysis, specifically the Cox Proportional Hazards (CoxPH) model, with a specific focus on mitigating data heterogeneity and elevating model performance. We present an FL approach that employs feature-based clustering to enhance model accuracy across synthetic datasets and real-world applications, including the Surveillance, Epidemiology, and End Results (SEER) database. Furthermore, we consider an event-based reporting strategy that provides a dynamic approach to model adaptation by responding to local data changes. Our experiments show the efficacy of our approach and discuss future directions for a practical application of FL in healthcare.

data heterogeneity, dataset, survival analysis, (12 more...)

arXiv.org Machine Learning

2407.1496

Country:

North America > United States > Missouri (0.05)
North America > United States > Indiana > Marion County > Indianapolis (0.04)
North America > United States > Alabama (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.93)
Health & Medicine > Consumer Health (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Vitsakis, Nikolas, Parekh, Amit, Konstas, Ioannis

Voices in a Crowd: Searching for Clusters of Unique Perspectives

arXiv.org Artificial IntelligenceJul-19-2024

Language models have been shown to reproduce underlying biases existing in their training data, which is the majority perspective by default. Proposed solutions aim to capture minority perspectives by either modelling annotator disagreements or grouping annotators based on shared metadata, both of which face significant challenges. We propose a framework that trains models without encoding annotator metadata, extracts latent embeddings informed by annotator behaviour, and creates clusters of similar opinions, that we refer to as voices. Resulting clusters are validated post-hoc via internal and external quantitative metrics, as well a qualitative analysis to identify the type of voice that each cluster represents. Our results demonstrate the strong generalisation capability of our framework, indicated by resulting clusters being adequately robust, while also capturing minority perspectives based on different demographic factors throughout two distinct datasets.

annotator, cross attention, dataset, (15 more...)

2407.14259

Country:

Asia > Singapore (0.04)
North America > United States > New York (0.04)
North America > United States > California > Alameda County > Oakland (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Government (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Kia, Solmaz, Martinez, Sonia

Multi-agent Coverage Control: From Discrete Assignments to Continuous Multi-agent Distribution Matching

arXiv.org Artificial IntelligenceJul-18-2024

The multi-agent spatial coverage control problem encompasses a broad research domain, dealing with both dynamic and static deployment strategies, discrete-task assignments, and spatial distribution-matching deployment. Coverage control may involve the deployment of a finite number of agents or a continuum through centralized or decentralized, locally-interacting schemes. All these problems can be solved via a different taxonomy of deployment algorithms for multiple agents. Depending on the application scenario, these problems involve from purely discrete descriptions of tasks (finite loads) and agents (finite resources), to a mixture of discrete and continuous elements, to fully continuous descriptions of the same. Yet, it is possible to find common features that underline all the above formulations, which we aim to illustrate here. By doing so, we aim to point the reader to novel references related to these problems. The short article outline is the following: Static coverage via concurrent area partitioning and assignment; Static coverage as a discrete task assignment; and Continuum task assignment for large-scale swarms.

agent, algorithm, deployment, (15 more...)

2407.1389

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(17 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Hassan, Bryar A., Tayfor, Noor Bahjat, Hassan, Alla A., Ahmed, Aram M., Rashid, Tarik A., Abdalla, Naz N.

From A-to-Z Review of Clustering Validation Indices

arXiv.org Artificial IntelligenceJul-18-2024

Data clustering involves identifying latent similarities within a dataset and organizing them into clusters or groups. The outcomes of various clustering algorithms differ as they are susceptible to the intrinsic characteristics of the original dataset, including noise and dimensionality. The effectiveness of such clustering procedures directly impacts the homogeneity of clusters, underscoring the significance of evaluating algorithmic outcomes. Consequently, the assessment of clustering quality presents a significant and complex endeavor. A pivotal aspect affecting clustering validation is the cluster validity metric, which aids in determining the optimal number of clusters. The main goal of this study is to comprehensively review and explain the mathematical operation of internal and external cluster validity indices, but not all, to categorize these indices and to brainstorm suggestions for future advancement of clustering validation research. In addition, we review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms, such as the evolutionary clustering algorithm star (ECA*). Finally, we suggest a classification framework for examining the functionality of both internal and external clustering validation measures regarding their ideal values, user-friendliness, responsiveness to input data, and appropriateness across various fields. This classification aids researchers in selecting the appropriate clustering validation measure to suit their specific requirements.

algorithm, dataset, validity index, (16 more...)

doi: 10.1016/j.neucom.2024.128198

2407.20246

Country:

Asia > Middle East > Iraq > Kurdistan Region > Sulaymaniyah Governorate > Sulaymaniyah (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
South America > Brazil > Paraná > Curitiba (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Yao, Xue, Calvert, Simeon C., Hoogendoorn, Serge P.

Driving pattern interpretation based on action phases clustering

Current approaches to identifying driving heterogeneity face challenges in comprehending fundamental patterns from the perspective of underlying driving behavior mechanisms. The concept of Action phases was proposed in our previous work, capturing the diversity of driving characteristics with physical meanings. This study presents a novel framework to further interpret driving patterns by classifying Action phases in an unsupervised manner. In this framework, a Resampling and Downsampling Method (RDM) is first applied to standardize the length of Action phases. Then the clustering calibration procedure including ''Feature Selection'', ''Clustering Analysis'', ''Difference/Similarity Evaluation'', and ''Action phases Re-extraction'' is iteratively applied until all differences among clusters and similarities within clusters reach the pre-determined criteria. Application of the framework using real-world datasets revealed six driving patterns in the I80 dataset, labeled as ''Catch up'', ''Keep away'', and ''Maintain distance'', with both ''Stable'' and ''Unstable'' states. Notably, Unstable patterns are more numerous than Stable ones. ''Maintain distance'' is the most common among Stable patterns. These observations align with the dynamic nature of driving. Two patterns ''Stable keep away'' and ''Unstable catch up'' are missing in the US101 dataset, which is in line with our expectations as this dataset was previously shown to have less heterogeneity. This demonstrates the potential of driving patterns in describing driving heterogeneity. The proposed framework promises advantages in addressing label scarcity in supervised learning and enhancing tasks such as driving behavior modeling and driving trajectory prediction.

action phase, dataset, us101 dataset, (17 more...)

2407.17518

Country:

Europe > Netherlands > South Holland > Delft (0.05)
Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Automobiles & Trucks (0.68)
Transportation > Ground > Road (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Jigsaw Game: Federated Clustering

Xu, Jinxuan, Chen, Hong-You, Chao, Wei-Lun, Zhang, Yuqian

Federated learning has recently garnered significant attention, especially within the domain of supervised learning. However, despite the abundance of unlabeled data on end-users, unsupervised learning problems such as clustering in the federated setting remain underexplored. In this paper, we investigate the federated clustering problem, with a focus on federated k-means. We outline the challenge posed by its non-convex objective and data heterogeneity in the federated framework. To tackle these challenges, we adopt a new perspective by studying the structures of local solutions in k-means and propose a one-shot algorithm called FeCA (Federated Centroid Aggregation). FeCA adaptively refines local solutions on clients, then aggregates these refined solutions to recover the global solution of the entire dataset in a single round. We empirically demonstrate the robustness of FeCA under various federated scenarios on both synthetic and real-world data. Additionally, we extend FeCA to representation learning and present DeepFeCA, which combines Deep-Cluster and FeCA for unsupervised feature learning in the federated setting.

centroid, local solution, true center, (15 more...)

2407.12764

Country:

North America > United States > Ohio (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
North America > United States > Virginia (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (0.67)
Health & Medicine > Pharmaceuticals & Biotechnology (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Gulati, Aryan, Dong, Xingjian, Hurtado, Carlos, Shekkizhar, Sarath, Swayamdipta, Swabha, Ortega, Antonio

Out-of-Distribution Detection through Soft Clustering with Non-Negative Kernel Regression

As language models become more general purpose, increased attention needs to be paid to detecting out-of-distribution (OOD) instances, i.e., those not belonging to any of the distributions seen during training. Existing methods for detecting OOD data are computationally complex and storage-intensive. We propose a novel soft clustering approach for OOD detection based on non-negative kernel regression. Our approach greatly reduces computational and space complexities (up to 11x improvement in inference time and 87% reduction in storage requirements) and outperforms existing approaches by up to 4 AUROC points on four different benchmarks. We also introduce an entropy-constrained version of our algorithm, which leads to further reductions in storage requirements (up to 97% lower than comparable approaches) while retaining competitive performance. Our soft clustering approach for OOD detection highlights its potential for detecting tail-end phenomena in extreme-scale data settings.

dataset, detection, representation, (16 more...)

2407.13141

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Singapore (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Profiling quantum circuits for their efficient execution on single- and multi-core architectures

Bandic, Medina, Henaff, Pablo le, Ovide, Anabel, Escofet, Pau, Rached, Sahar Ben, Rodrigo, Santiago, van Someren, Hans, Abadal, Sergi, Alarcon, Eduard, Almudever, Carmen G., Feld, Sebastian

Application-specific quantum computers offer the most efficient means to tackle problems intractable by classical computers. Realizing these architectures necessitates a deep understanding of quantum circuit properties and their relationship to execution outcomes on quantum devices. Our study aims to perform for the first time a rigorous examination of quantum circuits by introducing graph theory-based metrics extracted from their qubit interaction graph and gate dependency graph alongside conventional parameters describing the circuit itself. This methodology facilitates a comprehensive analysis and clustering of quantum circuits. Furthermore, it uncovers a connection between parameters rooted in both qubit interaction and gate dependency graphs, and the performance metrics for quantum circuit mapping, across a range of established quantum device and mapping configurations. Among the various device configurations, we particularly emphasize modular (i.e., multi-core) quantum computing architectures due to their high potential as a viable solution for quantum device scalability. This thorough analysis will help us to: i) identify key attributes of quantum circuits that affect the quantum circuit mapping performance metrics; ii) predict the performance on a specific chip for similar circuit structures; iii) determine preferable combinations of mapping techniques and hardware setups for specific circuits; and iv) define representative benchmark sets by clustering similarly structured circuits.

architecture, graph, quantum circuit, (15 more...)

2407.1264

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
Europe > Netherlands > South Holland > Delft (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)