Clustering
Next Waves in Veridical Network Embedding
Ward, Owen G., Huang, Zhen, Davison, Andrew, Zheng, Tian
Embedding nodes of a large network into a metric (e.g., Euclidean) space has become an area of active research in statistical machine learning, which has found applications in natural and social sciences. Generally, a representation of a network object is learned in a Euclidean geometry and is then used for subsequent tasks regarding the nodes and/or edges of the network, such as community detection, node classification and link prediction. Network embedding algorithms have been proposed in multiple disciplines, often with domain-specific notations and details. In addition, different measures and tools have been adopted to evaluate and compare the methods proposed under different settings, often dependent of the downstream tasks. As a result, it is challenging to study these algorithms in the literature systematically. Motivated by the recently proposed Veridical Data Science (VDS) framework, we propose a framework for network embedding algorithms and discuss how the principles of predictability, computability and stability apply in this context. The utilization of this framework in network embedding holds the potential to motivate and point to new directions for future research.
A Performance Guarantee for Spectral Clustering
Boedihardjo, March, Deng, Shaofeng, Strohmer, Thomas
The two-step spectral clustering method, which consists of the Laplacian eigenmap and a rounding step, is a widely used method for graph partitioning. It can be seen as a natural relaxation to the NP-hard minimum ratio cut problem. In this paper we study the central question: when is spectral clustering able to find the global solution to the minimum ratio cut problem? First we provide a condition that naturally depends on the intra- and inter-cluster connectivities of a given partition under which we may certify that this partition is the solution to the minimum ratio cut problem. Then we develop a deterministic two-to-infinity norm perturbation bound for the the invariant subspace of the graph Laplacian that corresponds to the $k$ smallest eigenvalues. Finally by combining these two results we give a condition under which spectral clustering is guaranteed to output the global solution to the minimum ratio cut problem, which serves as a performance guarantee for spectral clustering.
Detecting Malicious Accounts in Permissionless Blockchains using Temporal Graph Properties
Agarwal, Rachit, Barve, Shikhar, Shukla, Sandeep Kuman
The temporal nature of modeling accounts as nodes and transactions as directed edges in a directed graph -- for a blockchain, enables us to understand the behavior (malicious or benign) of the accounts. Predictive classification of accounts as malicious or benign could help users of the permissionless blockchain platforms to operate in a secure manner. Motivated by this, we introduce temporal features such as burst and attractiveness on top of several already used graph properties such as the node degree and clustering coefficient. Using identified features, we train various Machine Learning (ML) algorithms and identify the algorithm that performs the best in detecting which accounts are malicious. We then study the behavior of the accounts over different temporal granularities of the dataset before assigning them malicious tags. For Ethereum blockchain, we identify that for the entire dataset - the ExtraTreesClassifier performs the best among supervised ML algorithms. On the other hand, using cosine similarity on top of the results provided by unsupervised ML algorithms such as K-Means on the entire dataset, we were able to detect 554 more suspicious accounts. Further, using behavior change analysis for accounts, we identify 814 unique suspicious accounts across different temporal granularities.
MAGIC: Multi-scale Heterogeneity Analysis and Clustering for Brain Diseases
Wen, Junhao, Varol, Erdem, Chand, Ganesh, Sotiras, Aristeidis, Davatzikos, Christos
There is a growing amount of clinical, anatomical and functional evidence for the heterogeneous presentation of neuropsychiatric and neurodegenerative diseases such as schizophrenia and Alzheimers Disease (AD). Elucidating distinct subtypes of diseases allows a better understanding of neuropathogenesis and enables the possibility of developing targeted treatment programs. Recent semi-supervised clustering techniques have provided a data-driven way to understand disease heterogeneity. However, existing methods do not take into account that subtypes of the disease might present themselves at different spatial scales across the brain. Here, we introduce a novel method, MAGIC, to uncover disease heterogeneity by leveraging multi-scale clustering. We first extract multi-scale patterns of structural covariance (PSCs) followed by a semi-supervised clustering with double cyclic block-wise optimization across different scales of PSCs. We validate MAGIC using simulated heterogeneous neuroanatomical data and demonstrate its clinical potential by exploring the heterogeneity of AD using T1 MRI scans of 228 cognitively normal (CN) and 191 patients. Our results indicate two main subtypes of AD with distinct atrophy patterns that consist of both fine-scale atrophy in the hippocampus as well as large-scale atrophy in cortical regions. The evidence for the heterogeneity is further corroborated by the clinical evaluation of two subtypes, which indicates that there is a subpopulation of AD patients that tend to be younger and decline faster in cognitive performance relative to the other subpopulation, which tends to be older and maintains a relatively steady decline in cognitive abilities.
Fair Algorithms for Hierarchical Agglomerative Clustering
Chhabra, Anshuman, Mohapatra, Prasant
Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science and machine learning, and seek to partition the dataset into clusters while generating a hierarchical relationship between the data samples themselves. HAC algorithms are employed in a number of applications, such as biology, natural language processing, and recommender systems. Thus, it is imperative to ensure that these algorithms are fair-- even if the dataset contains biases against certain protected groups, the cluster outputs generated should not be discriminatory against samples from any of these groups. However, recent work in clustering fairness has mostly focused on center-based clustering algorithms, such as k-median and k-means clustering. Therefore, in this paper, we propose fair algorithms for performing HAC that enforce fairness constraints 1) irrespective of the distance linkage criteria used, 2) generalize to any natural measures of clustering fairness for HAC, 3) work for multiple protected groups, and 4) have competitive running times to vanilla HAC. To the best of our knowledge, this is the first work that studies fairness for HAC algorithms. We also propose an algorithm with lower asymptotic time complexity than HAC algorithms that can rectify existing HAC outputs and make them subsequently fair as a result. Moreover, we carry out extensive experiments on multiple real-world UCI datasets to demonstrate the working of our algorithms.
Model-based Clustering using Automatic Differentiation: Confronting Misspecification and High-Dimensional Data
Kasa, Siva Rajesh, Rajan, Vaibhav
We study two practically important cases of model based clustering using Gaussian Mixture Models: (1) when there is misspecification and (2) on high dimensional data, in the light of recent advances in Gradient Descent (GD) based optimization using Automatic Differentiation (AD). Our simulation studies show that EM has better clustering performance, measured by Adjusted Rand Index, compared to GD in cases of misspecification, whereas on high dimensional data GD outperforms EM. We observe that both with EM and GD there are many solutions with high likelihood but poor cluster interpretation. To address this problem we design a new penalty term for the likelihood based on the Kullback Leibler divergence between pairs of fitted components. Closed form expressions for the gradients of this penalized likelihood are difficult to derive but AD can be done effortlessly, illustrating the advantage of AD-based optimization. Extensions of this penalty for high dimensional data and for model selection are discussed. Numerical experiments on synthetic and real datasets demonstrate the efficacy of clustering using the proposed penalized likelihood approach.
Multi-Manifold Learning for Large-scale Targeted Advertising System
Shin, Kyuyong, Park, Young-Jin, Kim, Kyung-Min, Kwon, Sunyoung
Messenger advertisements (ads) give direct and personal user experience yielding high conversion rates and sales. However, people are skeptical about ads and sometimes perceive them as spam, which eventually leads to a decrease in user satisfaction. Targeted advertising, which serves ads to individuals who may exhibit interest in a particular advertising message, is strongly required. The key to the success of precise user targeting lies in learning the accurate user and ad representation in the embedding space. Most of the previous studies have limited the representation learning in the Euclidean space, but recent studies have suggested hyperbolic manifold learning for the distinct projection of complex network properties emerging from real-world datasets such as social networks, recommender systems, and advertising. We propose a framework that can effectively learn the hierarchical structure in users and ads on the hyperbolic space, and extend to the Multi-Manifold Learning. Our method constructs multiple hyperbolic manifolds with learnable curvatures and maps the representation of user and ad to each manifold. The origin of each manifold is set as the centroid of each user cluster. The user preference for each ad is estimated using the distance between two entities in the hyperbolic space, and the final prediction is determined by aggregating the values calculated from the learned multiple manifolds. We evaluate our method on public benchmark datasets and a large-scale commercial messenger system LINE, and demonstrate its effectiveness through improved performance.
Community detection and Social Network analysis based on the Italian wars of the 15th century
Fumanal-Idocin, J., Alonso-Betanzos, A., Cordรณn, O., Bustince, H., Minรกrovรก, M.
In this contribution we study social network modelling by using human interaction as a basis. To do so, we propose a new set of functions, affinities, designed to capture the nature of the local interactions among each pair of actors in a network. By using these functions, we develop a new community detection algorithm, the Borgia Clustering, where communities naturally arise from the multi-agent interaction in the network. We also discuss the effects of size and scale for communities regarding this case, as well as how we cope with the additional complexity present when big communities arise. Finally, we compare our community detection solution with other representative algorithms, finding favourable results.
Learning from Data to Optimize Control in Precision Farming
Kocian, Alexander, Incrocci, Luca
Precision farming is one way of many to meet a 70 percent increase in global demand for agricultural products on current agricultural land by 2050 at reduced need of fertilizers and efficient use of water resources. The catalyst for the emergence of precision farming has been satellite positioning and navigation followed by Internet-of-Things, generating vast information that can be used to optimize farming processes in real-time. Statistical tools from data mining, predictive modeling, and machine learning analyze pattern in historical data, to make predictions about future events as well as intelligent actions. This special issue presents the latest development in statistical inference, machine learning and optimum control for precision farming.
Conditional gradient methods for stochastically constrained convex minimization
Vladarean, Maria-Luiza, Alacaoglu, Ahmet, Hsieh, Ya-Ping, Cevher, Volkan
We propose two novel conditional gradient-based methods for solving structured stochastic convex optimization problems with a large number of linear constraints. Instances of this template naturally arise from SDP-relaxations of combinatorial problems, which involve a number of constraints that is polynomial in the problem dimension. The most important feature of our framework is that only a subset of the constraints is processed at each iteration, thus gaining a computational advantage over prior works that require full passes. Our algorithms rely on variance reduction and smoothing used in conjunction with conditional gradient steps, and are accompanied by rigorous convergence guarantees. Preliminary numerical experiments are provided for illustrating the practical performance of the methods.