AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Large sample spectral analysis of graph-based multi-manifold clustering

Trillos, Nicolas Garcia, He, Pengfei, Li, Chenghui

arXiv.org Machine LearningJul-28-2021

In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds $\mathcal{M} = \mathcal{M}_1 \cup\dots \cup \mathcal{M}_N$ that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on $\mathcal{M}$ with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.

graph, laplacian, manifold, (15 more...)

arXiv.org Machine Learning

2107.1361

Country:

North America > United States > Wisconsin > Dane County > Madison (0.28)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > Michigan > Ingham County > Lansing (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Checking Patch Behaviour against Test Specification

Tian, Haoye, Li, Yinghua, Pian, Weiguo, Kaboré, Abdoul Kader, Liu, Kui, Klein, Jacques, Bissyande, Tegawendé F.

arXiv.org Artificial IntelligenceJul-28-2021

Towards predicting patch correctness in APR, we propose a simple, but novel hypothesis on how the link between the patch behaviour and failing test specifications can be drawn: similar failing test cases should require similar patches. We then propose BATS, an unsupervised learning-based system to predict patch correctness by checking patch Behaviour Against failing Test Specification. BATS exploits deep representation learning models for code and patches: for a given failing test case, the yielded embedding is used to compute similarity metrics in the search for historical similar test cases in order to identify the associated applied patches, which are then used as a proxy for assessing generated patch correctness. Experimentally, we first validate our hypothesis by assessing whether ground-truth developer patches cluster together in the same way that their associated failing test cases are clustered. Then, after collecting a large dataset of 1278 plausible patches (written by developers or generated by some 32 APR tools), we use BATS to predict correctness: BATS achieves an AUC between 0.557 to 0.718 and a recall between 0.562 and 0.854 in identifying correct patches. Compared against previous work, we demonstrate that our approach outperforms state-of-the-art performance in patch correctness prediction, without the need for large labeled patch datasets in contrast with prior machine learning-based approaches. While BATS is constrained by the availability of similar test cases, we show that it can still be complementary to existing approaches: used in conjunction with a recent approach implementing supervised learning, BATS improves the overall recall in detecting correct patches. We finally show that BATS can be complementary to the state-of-the-art PATCH-SIM dynamic approach of identifying the correct patches for APR tools.

arXiv.org Artificial Intelligence

2107.13296

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Chongqing Province > Chongqing (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

FedGroup: Efficient Clustered Federated Learning via Decomposed Data-Driven Measure

Duan, Moming, Liu, Duo, Ji, Xinyuan, Liu, Renping, Liang, Liang, Chen, Xianzhang, Tan, Yujuan

arXiv.org Artificial IntelligenceJul-27-2021

Federated Learning (FL) enables the multiple participating devices to collaboratively contribute to a global neural network model while keeping the training data locally. Unlike the centralized training setting, the non-IID and imbalanced (statistical heterogeneity) training data of FL is distributed in the federated network, which will increase the divergences between the local models and global model, further degrading performance. In this paper, we propose a novel clustered federated learning (CFL) framework FedGroup, in which we 1) group the training of clients based on the similarities between the clients' optimization directions for high training performance; 2) construct a new data-driven distance measure to improve the efficiency of the client clustering procedure. 3) implement a newcomer device cold start mechanism based on the auxiliary global model for framework scalability and practicality. FedGroup can achieve improvements by dividing joint optimization into groups of sub-optimization and can be combined with FL optimizer FedProx. The convergence and complexity are analyzed to demonstrate the efficiency of our proposed framework. We also evaluate FedGroup and FedGrouProx (combined with FedProx) on several open datasets and made comparisons with related CFL frameworks. The results show that FedGroup can significantly improve absolute test accuracy by +14.1% on FEMNIST compared to FedAvg. +3.4% on Sentiment140 compared to FedProx, +6.9% on MNIST compared to FeSEM.

accuracy, fedgroup, heterogeneity, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00042

2010.0687

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Chongqing Province > Chongqing (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Scalable Community Detection via Parallel Correlation Clustering

Shi, Jessica, Dhulipala, Laxman, Eisenstat, David, Łącki, Jakub, Mirrokni, Vahab

arXiv.org Artificial IntelligenceJul-27-2021

Graph clustering and community detection are central problems in modern data mining. The increasing need for analyzing billion-scale data calls for faster and more scalable algorithms for these problems. There are certain trade-offs between the quality and speed of such clustering algorithms. In this paper, we design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework based on the LambdaCC objective (introduced by Veldt et al.), which encompasses modularity and correlation clustering. Our framework consists of highly-optimized implementations that scale to large data sets of billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44x speedups over our own sequential baselines while maintaining or improving quality.

graph, objective, vertex, (16 more...)

arXiv.org Artificial Intelligence

2108.01731

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Oceania > Australia (0.04)
(9 more...)

Genre: Research Report (0.40)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Unsupervised Learning of Neurosymbolic Encoders

Zhan, Eric, Sun, Jennifer J., Kennedy, Ann, Yue, Yisong, Chaudhuri, Swarat

arXiv.org Artificial IntelligenceJul-27-2021

We present a framework for the unsupervised learning of neurosymbolic encoders, i.e., encoders obtained by composing neural networks with symbolic programs from a domain-specific language. Such a framework can naturally incorporate symbolic expert knowledge into the learning process and lead to more interpretable and factorized latent representations than fully neural encoders. Also, models learned this way can have downstream impact, as many analysis workflows can benefit from having clean programmatic descriptions. We ground our learning algorithm in the variational autoencoding (VAE) framework, where we aim to learn a neurosymbolic encoder in conjunction with a standard decoder. Our algorithm integrates standard VAE-style training with modern program synthesis techniques. We evaluate our method on learning latent representations for real-world trajectory data from animal biology and sports analytics. We show that our approach offers significantly better separation than standard VAEs and leads to practical gains on downstream tasks.

encoder, representation, trajectory, (14 more...)

arXiv.org Artificial Intelligence

2107.13132

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.84)
(2 more...)

Add feedback

A Data-Driven Biophysical Computational Model of Parkinson's Disease based on Marmoset Monkeys

Ranieri, Caetano M., Pimentel, Jhielson M., Romano, Marcelo R., Elias, Leonardo A., Romero, Roseli A. F., Lones, Michael A., Araujo, Mariana F. P., Vargas, Patricia A., Moioli, Renan C.

arXiv.org Artificial IntelligenceJul-26-2021

In this work we propose a new biophysical computational model of brain regions relevant to Parkinson's Disease based on local field potential data collected from the brain of marmoset monkeys. Parkinson's disease is a neurodegenerative disorder, linked to the death of dopaminergic neurons at the substantia nigra pars compacta, which affects the normal dynamics of the basal ganglia-thalamus-cortex neuronal circuit of the brain. Although there are multiple mechanisms underlying the disease, a complete description of those mechanisms and molecular pathogenesis are still missing, and there is still no cure. To address this gap, computational models that resemble neurobiological aspects found in animal models have been proposed. In our model, we performed a data-driven approach in which a set of biologically constrained parameters is optimised using differential evolution. Evolved models successfully resembled single-neuron mean firing rates and spectral signatures of local field potentials from healthy and parkinsonian marmoset brain data. As far as we are concerned, this is the first computational model of Parkinson's Disease based on simultaneous electrophysiological recordings from seven brain regions of Marmoset monkeys. Results show that the proposed model could facilitate the investigation of the mechanisms of PD and support the development of techniques that can indicate new therapies. It could also be applied to other computational neuroscience problems in which biological data could be used to fit multi-scale models of brain circuits.

computational model, neuron, pd condition, (17 more...)

arXiv.org Artificial Intelligence

2107.12536

Country:

North America > United States (0.14)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
South America > Brazil > São Paulo > Campinas (0.04)
(4 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (1.00)
Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Neuroscience (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.45)

Add feedback

EGGS: Eigen-Gap Guided Search Making Subspace Clustering Easy

Fan, Jicong, Tu, Yiheng, Zhang, Zhao, Zhao, Mingbo

arXiv.org Artificial IntelligenceJul-26-2021

The performance of spectral clustering heavily relies on the quality of affinity matrix. A variety of affinity-matrix-construction methods have been proposed but they have hyper-parameters to determine beforehand, which requires strong experience and lead to difficulty in real applications especially when the inter-cluster similarity is high or/and the dataset is large. On the other hand, we often have to determine to use a linear model or a nonlinear model, which still depends on experience. To solve these two problems, in this paper, we present an eigen-gap guided search method for subspace clustering. The main idea is to find the most reliable affinity matrix among a set of candidates constructed by linear and kernel regressions, where the reliability is quantified by the \textit{relative-eigen-gap} of graph Laplacian defined in this paper. We show, theoretically and numerically, that the Laplacian matrix with a larger relative-eigen-gap often yields a higher clustering accuracy and stability. Our method is able to automatically search the best model and hyper-parameters in a pre-defined space. The search space is very easy to determine and can be arbitrarily large, though a relatively compact search space can reduce the highly unnecessary computation. Our method has high flexibility and convenience in real applications, and also has low computational cost because the affinity matrix is not computed by iterative optimization. We extend the method to large-scale datasets such as MNIST, on which the time cost is less than 90s and the clustering accuracy is state-of-the-art. Extensive experiments of natural image clustering show that our method is more stable, accurate, and efficient than baseline methods.

accuracy, matrix, subspace, (15 more...)

arXiv.org Artificial Intelligence

2107.12183

Country:

Europe > Spain > Andalusia > Cádiz Province > Cadiz (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Clustering by Maximizing Mutual Information Across Views

Do, Kien, Tran, Truyen, Venkatesh, Svetha

arXiv.org Artificial IntelligenceJul-24-2021

We propose a novel framework for image clustering that incorporates joint representation learning and clustering. Our method consists of two heads that share the same backbone network - a "representation learning" head and a "clustering" head. The "representation learning" head captures fine-grained patterns of objects at the instance level which serve as clues for the "clustering" head to extract coarse-grain information that separates objects into clusters. The whole model is trained in an end-to-end manner by minimizing the weighted sum of two sample-oriented contrastive losses applied to the outputs of the two heads. To ensure that the contrastive loss corresponding to the "clustering" head is optimal, we introduce a novel critic function called "log-of-dot-product". Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art single-stage clustering methods across a variety of image datasets, improving over the best baseline by about 5-7% in accuracy on CIFAR10/20, STL10, and ImageNet-Dogs. Further, the "two-stage" variant of our method also achieves better results than baselines on three challenging ImageNet subsets.

information, representation, vector, (14 more...)

arXiv.org Artificial Intelligence

2107.11635

Country: Oceania > Australia (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Multi-Perspective Content Delivery Networks Security Framework Using Optimized Unsupervised Anomaly Detection

Yang, Li, Moubayed, Abdallah, Shami, Abdallah, Heidari, Parisa, Boukhtouta, Amine, Larabi, Adel, Brunner, Richard, Preda, Stere, Migault, Daniel

arXiv.org Artificial IntelligenceJul-23-2021

Content delivery networks (CDNs) provide efficient content distribution over the Internet. CDNs improve the connectivity and efficiency of global communications, but their caching mechanisms may be breached by cyber-attackers. Among the security mechanisms, effective anomaly detection forms an important part of CDN security enhancement. In this work, we propose a multi-perspective unsupervised learning framework for anomaly detection in CDNs. In the proposed framework, a multi-perspective feature engineering approach, an optimized unsupervised anomaly detection model that utilizes an isolation forest and a Gaussian mixture model, and a multi-perspective validation method, are developed to detect abnormal behaviors in CDNs mainly from the client Internet Protocol (IP) and node perspectives, therefore to identify the denial of service (DoS) and cache pollution attack (CPA) patterns. Experimental results are presented based on the analytics of eight days of real-world CDN log data provided by a major CDN operator. Through experiments, the abnormal contents, compromised nodes, malicious IPs, as well as their corresponding attack types, are identified effectively by the proposed framework and validated by multiple cybersecurity experts. This shows the effectiveness of the proposed method when applied to real-world CDN data.

cyberwarfare, node, upstream oil & gas, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNSM.2021.3100308

2107.11514

Country:

North America > United States (0.46)
North America > Canada > Quebec (0.14)
North America > Canada > Ontario > Middlesex County > London (0.14)
(5 more...)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.50)
Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Federated Graph Classification over Non-IID Graphs

Xie, Han, Ma, Jing, Xiong, Li, Yang, Carl

arXiv.org Artificial IntelligenceJul-23-2021

Federated learning has emerged as an important paradigm for training machine learning models in different domains. For graph-level tasks such as graph classification, graphs can also be regarded as a special type of data samples, which can be collected and stored in separate local systems. Similar to other domains, multiple local systems, each holding a small set of graphs, may benefit from collaboratively training a powerful graph mining model, such as the popular graph neural networks (GNNs). To provide more motivation towards such endeavors, we analyze real-world graphs from different domains to confirm that they indeed share certain graph properties that are statistically significant compared with random graphs. However, we also find that different sets of graphs, even from the same domain or same dataset, are non-IID regarding both graph structures and node features. To handle this, we propose a graph clustered federated learning (GCFL) framework that dynamically finds clusters of local systems based on the gradients of GNNs, and theoretically justify that such clusters can reduce the structure and feature heterogeneity among graphs owned by the local systems. Moreover, we observe the gradients of GNNs to be rather fluctuating in GCFL which impedes high-quality clustering, and design a gradient sequence-based clustering mechanism based on dynamic time warping (GCFL+). Extensive experimental results and in-depth analysis demonstrate the effectiveness of our proposed frameworks.

dataset, graph, heterogeneity, (14 more...)

arXiv.org Artificial Intelligence

2106.13423

Country:

North America > United States > Virginia (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback