AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Next Waves in Veridical Network Embedding

Ward, Owen G., Huang, Zhen, Davison, Andrew, Zheng, Tian

arXiv.org Machine LearningJul-10-2020

Embedding nodes of a large network into a metric (e.g., Euclidean) space has become an area of active research in statistical machine learning, which has found applications in natural and social sciences. Generally, a representation of a network object is learned in a Euclidean geometry and is then used for subsequent tasks regarding the nodes and/or edges of the network, such as community detection, node classification and link prediction. Network embedding algorithms have been proposed in multiple disciplines, often with domain-specific notations and details. In addition, different measures and tools have been adopted to evaluate and compare the methods proposed under different settings, often dependent of the downstream tasks. As a result, it is challenging to study these algorithms in the literature systematically. Motivated by the recently proposed Veridical Data Science (VDS) framework, we propose a framework for network embedding algorithms and discuss how the principles of predictability, computability and stability apply in this context. The utilization of this framework in network embedding holds the potential to motivate and point to new directions for future research.

node, representation, similarity, (14 more...)

arXiv.org Machine Learning

2007.05385

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

A Performance Guarantee for Spectral Clustering

Boedihardjo, March, Deng, Shaofeng, Strohmer, Thomas

arXiv.org Machine LearningJul-10-2020

The two-step spectral clustering method, which consists of the Laplacian eigenmap and a rounding step, is a widely used method for graph partitioning. It can be seen as a natural relaxation to the NP-hard minimum ratio cut problem. In this paper we study the central question: when is spectral clustering able to find the global solution to the minimum ratio cut problem? First we provide a condition that naturally depends on the intra- and inter-cluster connectivities of a given partition under which we may certify that this partition is the solution to the minimum ratio cut problem. Then we develop a deterministic two-to-infinity norm perturbation bound for the the invariant subspace of the graph Laplacian that corresponds to the $k$ smallest eigenvalues. Finally by combining these two results we give a condition under which spectral clustering is guaranteed to output the global solution to the minimum ratio cut problem, which serves as a performance guarantee for spectral clustering.

artificial intelligence, machine learning, partition, (16 more...)

arXiv.org Machine Learning

2007.05627

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Yolo County > Davis (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Detecting Malicious Accounts in Permissionless Blockchains using Temporal Graph Properties

Agarwal, Rachit, Barve, Shikhar, Shukla, Sandeep Kuman

arXiv.org Machine LearningJul-10-2020

The temporal nature of modeling accounts as nodes and transactions as directed edges in a directed graph -- for a blockchain, enables us to understand the behavior (malicious or benign) of the accounts. Predictive classification of accounts as malicious or benign could help users of the permissionless blockchain platforms to operate in a secure manner. Motivated by this, we introduce temporal features such as burst and attractiveness on top of several already used graph properties such as the node degree and clustering coefficient. Using identified features, we train various Machine Learning (ML) algorithms and identify the algorithm that performs the best in detecting which accounts are malicious. We then study the behavior of the accounts over different temporal granularities of the dataset before assigning them malicious tags. For Ethereum blockchain, we identify that for the entire dataset - the ExtraTreesClassifier performs the best among supervised ML algorithms. On the other hand, using cosine similarity on top of the results provided by unsupervised ML algorithms such as K-Means on the entire dataset, we were able to detect 554 more suspicious accounts. Further, using behavior change analysis for accounts, we identify 814 unique suspicious accounts across different temporal granularities.

artificial intelligence, machine learning, malicious account, (16 more...)

arXiv.org Machine Learning

2007.05169

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Barbados > Christ Church (0.04)
Asia > Middle East > Israel (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety (1.00)
Information Technology > Security & Privacy (1.00)
Banking & Finance > Trading (1.00)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

MAGIC: Multi-scale Heterogeneity Analysis and Clustering for Brain Diseases

Wen, Junhao, Varol, Erdem, Chand, Ganesh, Sotiras, Aristeidis, Davatzikos, Christos

arXiv.org Machine LearningJul-9-2020

There is a growing amount of clinical, anatomical and functional evidence for the heterogeneous presentation of neuropsychiatric and neurodegenerative diseases such as schizophrenia and Alzheimers Disease (AD). Elucidating distinct subtypes of diseases allows a better understanding of neuropathogenesis and enables the possibility of developing targeted treatment programs. Recent semi-supervised clustering techniques have provided a data-driven way to understand disease heterogeneity. However, existing methods do not take into account that subtypes of the disease might present themselves at different spatial scales across the brain. Here, we introduce a novel method, MAGIC, to uncover disease heterogeneity by leveraging multi-scale clustering. We first extract multi-scale patterns of structural covariance (PSCs) followed by a semi-supervised clustering with double cyclic block-wise optimization across different scales of PSCs. We validate MAGIC using simulated heterogeneous neuroanatomical data and demonstrate its clinical potential by exploring the heterogeneity of AD using T1 MRI scans of 228 cognitively normal (CN) and 191 patients. Our results indicate two main subtypes of AD with distinct atrophy patterns that consist of both fine-scale atrophy in the hippocampus as well as large-scale atrophy in cortical regions. The evidence for the heterogeneity is further corroborated by the clinical evaluation of two subtypes, which indicates that there is a subpopulation of AD patients that tend to be younger and decline faster in cognitive performance relative to the other subpopulation, which tends to be older and maintains a relatively steady decline in cognitive abilities.

artificial intelligence, heterogeneity, machine learning, (13 more...)

arXiv.org Machine Learning

2007.00812

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New York (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.91)

Add feedback

Fair Algorithms for Hierarchical Agglomerative Clustering

Chhabra, Anshuman, Mohapatra, Prasant

arXiv.org Machine LearningJul-9-2020

Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science and machine learning, and seek to partition the dataset into clusters while generating a hierarchical relationship between the data samples themselves. HAC algorithms are employed in a number of applications, such as biology, natural language processing, and recommender systems. Thus, it is imperative to ensure that these algorithms are fair-- even if the dataset contains biases against certain protected groups, the cluster outputs generated should not be discriminatory against samples from any of these groups. However, recent work in clustering fairness has mostly focused on center-based clustering algorithms, such as k-median and k-means clustering. Therefore, in this paper, we propose fair algorithms for performing HAC that enforce fairness constraints 1) irrespective of the distance linkage criteria used, 2) generalize to any natural measures of clustering fairness for HAC, 3) work for multiple protected groups, and 4) have competitive running times to vanilla HAC. To the best of our knowledge, this is the first work that studies fairness for HAC algorithms. We also propose an algorithm with lower asymptotic time complexity than HAC algorithms that can rectify existing HAC outputs and make them subsequently fair as a result. Moreover, we carry out extensive experiments on multiple real-world UCI datasets to demonstrate the working of our algorithms.

algorithm, artificial intelligence, machine learning, (12 more...)

arXiv.org Machine Learning

2005.03197

Country:

North America > United States > California > Yolo County > Davis (0.14)
North America > United States > Virginia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Model-based Clustering using Automatic Differentiation: Confronting Misspecification and High-Dimensional Data

Kasa, Siva Rajesh, Rajan, Vaibhav

arXiv.org Machine LearningJul-8-2020

We study two practically important cases of model based clustering using Gaussian Mixture Models: (1) when there is misspecification and (2) on high dimensional data, in the light of recent advances in Gradient Descent (GD) based optimization using Automatic Differentiation (AD). Our simulation studies show that EM has better clustering performance, measured by Adjusted Rand Index, compared to GD in cases of misspecification, whereas on high dimensional data GD outperforms EM. We observe that both with EM and GD there are many solutions with high likelihood but poor cluster interpretation. To address this problem we design a new penalty term for the likelihood based on the Kullback Leibler divergence between pairs of fitted components. Closed form expressions for the gradients of this penalized likelihood are difficult to derive but AD can be done effortlessly, illustrating the advantage of AD-based optimization. Extensions of this penalty for high dimensional data and for model selection are discussed. Numerical experiments on synthetic and real datasets demonstrate the efficacy of clustering using the proposed penalized likelihood approach.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Machine Learning

2007.12786

Country:

Asia > Middle East > Jordan (0.04)
Asia > Singapore (0.04)
Oceania > Australia > Tasmania (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Multi-Manifold Learning for Large-scale Targeted Advertising System

Shin, Kyuyong, Park, Young-Jin, Kim, Kyung-Min, Kwon, Sunyoung

arXiv.org Machine LearningJul-8-2020

Messenger advertisements (ads) give direct and personal user experience yielding high conversion rates and sales. However, people are skeptical about ads and sometimes perceive them as spam, which eventually leads to a decrease in user satisfaction. Targeted advertising, which serves ads to individuals who may exhibit interest in a particular advertising message, is strongly required. The key to the success of precise user targeting lies in learning the accurate user and ad representation in the embedding space. Most of the previous studies have limited the representation learning in the Euclidean space, but recent studies have suggested hyperbolic manifold learning for the distinct projection of complex network properties emerging from real-world datasets such as social networks, recommender systems, and advertising. We propose a framework that can effectively learn the hierarchical structure in users and ads on the hyperbolic space, and extend to the Multi-Manifold Learning. Our method constructs multiple hyperbolic manifolds with learnable curvatures and maps the representation of user and ad to each manifold. The origin of each manifold is set as the centroid of each user cluster. The user preference for each ad is estimated using the distance between two entities in the hyperbolic space, and the final prediction is determined by aggregating the values calculated from the learned multiple manifolds. We evaluate our method on public benchmark datasets and a large-scale commercial messenger system LINE, and demonstrate its effectiveness through improved performance.

artificial intelligence, machine learning, manifold, (18 more...)

arXiv.org Machine Learning

2007.02334

Country:

North America > United States > California > San Diego County > San Diego (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Mateo County > Menlo Park (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.84)
Marketing (0.72)
Information Technology > Services (0.48)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Community detection and Social Network analysis based on the Italian wars of the 15th century

Fumanal-Idocin, J., Alonso-Betanzos, A., Cordón, O., Bustince, H., Minárová, M.

arXiv.org Artificial IntelligenceJul-7-2020

In this contribution we study social network modelling by using human interaction as a basis. To do so, we propose a new set of functions, affinities, designed to capture the nature of the local interactions among each pair of actors in a network. By using these functions, we develop a new community detection algorithm, the Borgia Clustering, where communities naturally arise from the multi-agent interaction in the network. We also discuss the effects of size and scale for communities regarding this case, as well as how we cope with the additional complexity present when big communities arise. Finally, we compare our community detection solution with other representative algorithms, finding favourable results.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.future.2020.06.030

2007.02641

Country:

Europe > Greece (0.14)
Europe > Italy (0.05)
Europe > France (0.04)
(51 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.70)
Media (0.69)
Information Technology > Services (0.63)
Government > Regional Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Learning from Data to Optimize Control in Precision Farming

Kocian, Alexander, Incrocci, Luca

arXiv.org Machine LearningJul-7-2020

Precision farming is one way of many to meet a 70 percent increase in global demand for agricultural products on current agricultural land by 2050 at reduced need of fertilizers and efficient use of water resources. The catalyst for the emergence of precision farming has been satellite positioning and navigation followed by Internet-of-Things, generating vast information that can be used to optimize farming processes in real-time. Statistical tools from data mining, predictive modeling, and machine learning analyze pattern in historical data, to make predictions about future events as well as intelligent actions. This special issue presents the latest development in statistical inference, machine learning and optimum control for precision farming.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2007.05493

Country:

Asia > Taiwan (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Nebraska (0.04)
(14 more...)

Genre: Research Report (0.83)

Industry: Food & Agriculture > Agriculture (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

Conditional gradient methods for stochastically constrained convex minimization

Vladarean, Maria-Luiza, Alacaoglu, Ahmet, Hsieh, Ya-Ping, Cevher, Volkan

arXiv.org Machine LearningJul-7-2020

We propose two novel conditional gradient-based methods for solving structured stochastic convex optimization problems with a large number of linear constraints. Instances of this template naturally arise from SDP-relaxations of combinatorial problems, which involve a number of constraints that is polynomial in the problem dimension. The most important feature of our framework is that only a subset of the constraints is processed at each iteration, thus gaining a computational advantage over prior works that require full passes. Our algorithms rely on variance reduction and smoothing used in conjunction with conditional gradient steps, and are accompanied by rigorous convergence guarantees. Preliminary numerical experiments are provided for illustrating the practical performance of the methods.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Machine Learning

2007.03795

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(3 more...)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback