AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Symmetrical SyncMap for Imbalanced General Chunking Problems

Zhang, Heng, Vargas, Danilo Vasconcellos

arXiv.org Artificial IntelligenceOct-16-2023

Recently, SyncMap pioneered an approach to learn complex structures from sequences as well as adapt to any changes in underlying structures. This is achieved by using only nonlinear dynamical equations inspired by neuron group behaviors, i.e., without loss functions. Here we propose Symmetrical SyncMap that goes beyond the original work to show how to create dynamical equations and attractor-repeller points which are stable over the long run, even dealing with imbalanced continual general chunking problems (CGCPs). The main idea is to apply equal updates from negative and positive feedback loops by symmetrical activation. We then introduce the concept of memory window to allow for more positive updates. Our algorithm surpasses or ties other unsupervised state-of-the-art baselines in all 12 imbalanced CGCPs with various difficulties, including dynamically changing ones. To verify its performance in real-world scenarios, we conduct experiments on several well-studied structure learning problems. The proposed method surpasses substantially other methods in 3 out of 4 scenarios, suggesting that symmetrical activation plays a critical role in uncovering topological structures and even hierarchies encoded in temporal data.

node, symmetrical syncmap, syncmap, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.physd.2023.133923

2310.10045

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.47)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

Graph Mining for Cybersecurity: A Survey

Yan, Bo, Yang, Cheng, Shi, Chuan, Fang, Yong, Li, Qi, Ye, Yanfang, Du, Junping

arXiv.org Artificial IntelligenceOct-16-2023

The explosive growth of cyber attacks nowadays, such as malware, spam, and intrusions, caused severe consequences on society. Securing cyberspace has become an utmost concern for organizations and governments. Traditional Machine Learning (ML) based methods are extensively used in detecting cyber threats, but they hardly model the correlations between real-world cyber entities. In recent years, with the proliferation of graph mining techniques, many researchers investigated these techniques for capturing correlations between cyber entities and achieving high performance. It is imperative to summarize existing graph-based cybersecurity solutions to provide a guide for future studies. Therefore, as a key contribution of this paper, we provide a comprehensive review of graph mining for cybersecurity, including an overview of cybersecurity tasks, the typical graph mining techniques, and the general process of applying them to cybersecurity, as well as various solutions for different cybersecurity tasks. For each task, we probe into relevant methods and highlight the graph types, graph approaches, and task levels in their modeling. Furthermore, we collect open datasets and toolkits for graph-based cybersecurity. Finally, we outlook the potential directions of this field for future research.

detection, graph, statistical feature, (11 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3610228

2304.00485

Country:

Asia > North Korea (0.14)
Asia > China > Beijing > Beijing (0.05)
North America > United States > Indiana > Saint Joseph County > South Bend (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (1.00)
Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.45)

Add feedback

A correlation-based fuzzy cluster validity index with secondary options detector

Wiroonsri, Nathakhun, Preedasawakul, Onthada

arXiv.org Machine LearningOct-16-2023

The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called UniversalCVI used in this work is available at https://CRAN.R-project.org/package=UniversalCVI.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Machine Learning

2308.14785

Country:

North America > United States > Wisconsin (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > Thailand (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Graph Neural Network approaches for single-cell data: A recent overview

Lazaros, Konstantinos, Koumadorakis, Dimitris E., Vlamos, Panagiotis, Vrahatis, Aristidis G.

arXiv.org Artificial IntelligenceOct-14-2023

Graph Neural Networks (GNN) are reshaping our understanding of biomedicine and diseases by revealing the deep connections among genes and cells. As both algorithmic and biomedical technologies have advanced significantly, we're entering a transformative phase of personalized medicine. While pioneering tools like Graph Attention Networks (GAT) and Graph Convolutional Neural Networks (Graph CNN) are advancing graph-based learning, the rise of single-cell sequencing techniques is reshaping our insights on cellular diversity and function. Numerous studies have combined GNNs with single-cell data, showing promising results. In this work, we highlight the GNN methodologies tailored for single-cell data over the recent years. We outline the diverse range of graph deep learning architectures that center on GAT methodologies. Furthermore, we underscore the several objectives of GNN strategies in single-cell data contexts, ranging from cell-type annotation, data integration and imputation, gene regulatory network reconstruction, clustering and many others. This review anticipates a future where GNNs become central to single-cell analysis efforts, particularly as vast omics datasets are continuously generated and the interconnectedness of cells and genes enhances our depth of knowledge in biomedicine.

graph, neural network, representation, (16 more...)

arXiv.org Artificial Intelligence

2310.09561

Country:

Europe > Greece > Ionian Islands > Corfu (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Promising Solution (0.45)

Industry:

Telecommunications (1.00)
Information Technology > Networks (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Clustered FedStack: Intermediate Global Models with Bayesian Information Criterion

Shaik, Thanveer, Tao, Xiaohui, Li, Lin, Higgins, Niall, Gururajan, Raj, Zhou, Xujuan, Yong, Jianming

arXiv.org Artificial IntelligenceOct-14-2023

Federated Learning (FL) is currently one of the most popular technologies in the field of Artificial Intelligence (AI) due to its collaborative learning and ability to preserve client privacy. However, it faces challenges such as non-identically and non-independently distributed (non-IID) and data with imbalanced labels among local clients. To address these limitations, the research community has explored various approaches such as using local model parameters, federated generative adversarial learning, and federated representation learning. In our study, we propose a novel Clustered FedStack framework based on the previously published Stacked Federated Learning (FedStack) framework. The local clients send their model predictions and output layer weights to a server, which then builds a robust global model. This global model clusters the local clients based on their output layer weights using a clustering mechanism. We adopt three clustering mechanisms, namely K-Means, Agglomerative, and Gaussian Mixture Models, into the framework and evaluate their performance. We use Bayesian Information Criterion (BIC) with the maximum likelihood function to determine the number of clusters. The Clustered FedStack models outperform baseline models with clustering mechanisms. To estimate the convergence of our proposed framework, we use Cyclical learning rates.

fedstack model, local client, output layer weight, (14 more...)

arXiv.org Artificial Intelligence

2309.11044

Country:

Oceania > Australia > Queensland (0.05)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Global, Unified Representation of Heterogenous Robot Dynamics Using Composition Operators: A Koopman Direct Encoding Method

Asada, Harry

arXiv.org Artificial IntelligenceOct-14-2023

The dynamic complexity of robots and mechatronic systems often pertains to the hybrid nature of dynamics, where governing equations consist of heterogenous equations that are switched depending on the state of the system. Legged robots and manipulator robots experience contact-noncontact discrete transitions, causing switching of governing equations. Analysis of these systems have been a challenge due to the lack of a global, unified model that is amenable to analysis of the global behaviors. Composition operator theory has the potential to provide a global, unified representation by converting them to linear dynamical systems in a lifted space. The current work presents a method for encoding nonlinear heterogenous dynamics into a high dimensional space of observables in the form of Koopman operator. First, a new formula is established for representing the Koopman operator in a Hilbert space by using inner products of observable functions and their composition with the governing state transition function. This formula, called Direct Encoding, allows for converting a class of heterogenous systems directly to a global, unified linear model. Unlike prevalent data-driven methods, where results can vary depending on numerical data, the proposed method is globally valid, not requiring numerical simulation of the original dynamics. A simple example validates the theoretical results, and the method is applied to a multi-cable suspension system.

basis function, equation, matrix, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TMECH.2023.3253599

2208.00175

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Colorado > Denver County > Denver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report (0.82)
Personal (0.68)

Industry: Automobiles & Trucks (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Topological Data Analysis in smart manufacturing processes -- A survey on the state of the art

Uray, Martin, Giunti, Barbara, Kerber, Michael, Huber, Stefan

arXiv.org Artificial IntelligenceOct-13-2023

Topological Data Analysis (TDA) is a mathematical method using techniques from topology for the analysis of complex, multi-dimensional data that has been widely and successfully applied in several fields such as medicine, material science, biology, and others. This survey summarizes the state of the art of TDA in yet another application area: industrial manufacturing and production in the context of Industry 4.0. We perform a rigorous and reproducible literature search of applications of TDA on the setting of industrial production and manufacturing. The resulting works are clustered and analyzed based on their application area within the manufacturing process and their input data type. We highlight the key benefits of TDA and their tools in this area and describe its challenges, as well as future potential. Finally, we discuss which TDA methods are underutilized in (the specific area of) industry and the identified types of application, with the goal of prompting more research in this profitable area of application.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2310.09319

Country:

North America > United States (0.93)
Europe > Austria (0.28)
Europe > Germany (0.14)
Asia (0.14)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (0.93)
Semiconductors & Electronics (0.93)
Energy > Oil & Gas > Upstream (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Add feedback

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Liang, Davis, Gonen, Hila, Mao, Yuning, Hou, Rui, Goyal, Naman, Ghazvininejad, Marjan, Zettlemoyer, Luke, Khabsa, Madian

arXiv.org Artificial IntelligenceOct-13-2023

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

dataset, xlm-r, xlm-v, (14 more...)

arXiv.org Artificial Intelligence

2301.10472

Country:

North America > United States (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

CHIP: Contrastive Hierarchical Image Pretraining

Mittal, Arpit, Jhaveri, Harshil, Mallick, Swapnil, Ajmera, Abhishek

arXiv.org Artificial IntelligenceOct-12-2023

Few-shot object classification is the task of classifying objects in an image with limited number of examples as supervision. We propose a one-shot/few-shot classification model that can classify an object of any unseen class into a relatively general category in an hierarchically based classification. Our model uses a three-level hierarchical contrastive loss based ResNet152 classifier for classifying an object based on its features extracted from Image embedding, not used during the training phase. For our experimentation, we have used a subset of the ImageNet (ILSVRC-12) dataset that contains only the animal classes for training our model and created our own dataset of unseen classes for evaluating our trained model. Our model provides satisfactory results in classifying the unknown objects into a generic category which has been later discussed in greater detail.

animal class, phase 2, unseen class, (14 more...)

arXiv.org Artificial Intelligence

2310.08304

Country:

Oceania > Australia (0.04)
Asia (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Local Graph Clustering with Noisy Labels

de Luca, Artur Back, Fountoulakis, Kimon, Yang, Shenghao

arXiv.org Machine LearningOct-12-2023

The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.

artificial intelligence, graph, machine learning, (19 more...)

arXiv.org Machine Learning

2310.08031

Country: North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback