Clustering
MGTCOM: Community Detection in Multimodal Graphs
Dmitriev, E., Chekol, M. W., Wang, S.
Community detection is the task of discovering groups of nodes sharing similar patterns within a network. With recent advancements in deep learning, methods utilizing graph representation learning and deep clustering have shown great results in community detection. However, these methods often rely on the topology of networks (i) ignoring important features such as network heterogeneity, temporality, multimodality, and other possibly relevant features. Besides, (ii) the number of communities is not known a priori and is often left to model selection. In addition, (iii) in multimodal networks all nodes are assumed to be symmetrical in their features; while true for homogeneous networks, most of the real-world networks are heterogeneous where feature availability often varies. In this paper, we propose a novel framework (named MGTCOM) that overcomes the above challenges (i)--(iii). MGTCOM identifies communities through multimodal feature learning by leveraging a new sampling technique for unsupervised learning of temporal embeddings. Importantly, MGTCOM is an end-to-end framework optimizing network embeddings, communities, and the number of communities in tandem. In order to assess its performance, we carried out an extensive evaluation on a number of multimodal networks. We found out that our method is competitive against state-of-the-art and performs well in inductive inference.
10 Most Used Machine Learning Algorithms In Python With Code
Understanding what Artificial Intelligence is and learning how Machine Learning and Deep Learning power it, are overwhelming experiences. In ML, there is something called the "No Free Lunch" theorems which states that no machine learning algorithms works best for every problem, and it's particularly relevant for supervised learning. For example, you can't say that decision trees are always better than neural networks or vice-versa. There are various factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem, while using a hold-out "test set" of data to decide performance and select the winning algorithm.
An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?
Scott, Tyler R., Liu, Ting, Mozer, Michael C., Gallagher, Andrew C.
Recent research in clustering face embeddings has found that unsupervised, shallow, heuristic-based methods -- including $k$-means and hierarchical agglomerative clustering -- underperform supervised, deep, inductive methods. While the reported improvements are indeed impressive, experiments are mostly limited to face datasets, where the clustered embeddings are highly discriminative or well-separated by class (Recall@1 above 90% and often nearing ceiling), and the experimental methodology seemingly favors the deep methods. We conduct a large-scale empirical study of 17 clustering methods across three datasets and obtain several robust findings. Notably, deep methods are surprisingly fragile for embeddings with more uncertainty, where they match or even perform worse than shallow, heuristic-based methods. When embeddings are highly discriminative, deep methods do outperform the baselines, consistent with past results, but the margin between methods is much smaller than previously reported. We believe our benchmarks broaden the scope of supervised clustering methods beyond the face domain and can serve as a foundation on which these methods could be improved. To enable reproducibility, we include all necessary details in the appendices, and plan to release the code.
Spatiotemporal k-means
Dorabiala, Olga, Webster, Jennifer, Kutz, Nathan, Aravkin, Aleksandr
The widespread use of sensor and data acquisition technologies, including IOT, GPS, RFID, LIDAR, satellite, and cellular networks allows for, among other applications, the continuous monitoring of the positions of moving objects of interest. These technologies create rich spatiotemporal data that is found across many scientific and real-world domains including ecologists' studies of collective animal behavior [13], the surveillance of large groups of people for suspicious activity [17], and traffic management [12]. Often, the data collected is large and unlabeled, motivating the development of unsupervised learning methods that can efficiently extract information about object behavior with no human supervision. In this study, we propose a method of spatiotemporal k-means (STKM) clustering that is able to analyze the multi-scale relationships within spatiotemporal data. Clustering is a major unsupervised data mining tool used to gain insight from unlabeled data by grouping objects based on some similarity measure [6, 11]. The most common methods for unsupervised clustering include k-means, Gaussian mixture models, and hierarchical clustering [18], all of which are workhorse algorithms for the data science industry.
How Hierarchical Clustering works part1
Abstract: The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure -- unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not drastically increase above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage's speed.
Graph Summarization via Node Grouping: A Spectral Algorithm
Merchant, Arpit, Mathioudakis, Michael, Wang, Yanhao
Graph summarization via node grouping is a popular method to build concise graph representations by grouping nodes from the original graph into supernodes and encoding edges into superedges such that the loss of adjacency information is minimized. Such summaries have immense applications in large-scale graph analytics due to their small size and high query processing efficiency. In this paper, we reformulate the loss minimization problem for summarization into an equivalent integer maximization problem. By initially allowing relaxed (fractional) solutions for integer maximization, we analytically expose the underlying connections to the spectral properties of the adjacency matrix. Consequently, we design an algorithm called SpecSumm that consists of two phases. In the first phase, motivated by spectral graph theory, we apply k-means clustering on the k largest (in magnitude) eigenvectors of the adjacency matrix to assign nodes to supernodes. In the second phase, we propose a greedy heuristic that updates the initial assignment to further improve summary quality. Finally, via extensive experiments on 11 datasets, we show that SpecSumm efficiently produces high-quality summaries compared to state-of-the-art summarization algorithms and scales to graphs with millions of nodes.
Clustering of countries based on the associated social contact patterns in epidemiological modelling
Korir, Evans Kiptoo, Vizi, Zsolt
Mathematical models have been used to understand the spread patterns of infectious diseases such as Coronavirus Disease 2019 (COVID-19). The transmission component of the models can be modelled in an age-dependent manner via introducing contact matrix for the population, which describes the contact rates between the age groups. Since social contact patterns vary from country to country, we can compare and group the countries using the corresponding contact matrices. In this paper, we present a framework for clustering countries based on their contact matrices with respect to an underlying epidemic model. Since the pipeline is generic and modular, we demonstrate its application in a COVID-19 model from R\"ost et. al. which gives a hint about which countries can be compared in a pandemic situation, when only non-pharmaceutical interventions are available.
An Incremental Phase Mapping Approach for X-ray Diffraction Patterns using Binary Peak Representations
Jha, Dipendra, Narayanachari, K. V. L. V., Zhang, Ruifeng, Liao, Justin, Keane, Denis T., Liao, Wei-keng, Choudhary, Alok, Chung, Yip-Wah, Bedzyk, Michael, Agrawal, Ankit
Despite the huge advancement in knowledge discovery and data mining techniques, the X-ray diffraction (XRD) analysis process has mostly remained untouched and still involves manual investigation, comparison, and verification. Due to the large volume of XRD samples from high-throughput XRD experiments, it has become impossible for domain scientists to process them manually. Recently, they have started leveraging standard clustering techniques, to reduce the XRD pattern representations requiring manual efforts for labeling and verification. Nevertheless, these standard clustering techniques do not handle problem-specific aspects such as peak shifting, adjacent peaks, background noise, and mixed phases; hence, resulting in incorrect composition-phase diagrams that complicate further steps. Here, we leverage data mining techniques along with domain expertise to handle these issues. In this paper, we introduce an incremental phase mapping approach based on binary peak representations using a new threshold based fuzzy dissimilarity measure. The proposed approach first applies an incremental phase computation algorithm on discrete binary peak representation of XRD samples, followed by hierarchical clustering or manual merging of similar pure phases to obtain the final composition-phase diagram. We evaluate our method on the composition space of two ternary alloy systems- Co-Ni-Ta and Co-Ti-Ta. Our results are verified by domain scientists and closely resembles the manually computed ground-truth composition-phase diagrams. The proposed approach takes us closer towards achieving the goal of complete end-to-end automated XRD analysis.
Minimalist Data Wrangling with Python
Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results. This textbook is a non-profit project. Its online and PDF versions are freely available at https://datawranglingpy.gagolewski.com/.
Optimal Graph Filters for Clustering Attributed Graphs
Ortiz-Bouza, Meiby, Aviyente, Selin
The second class of methods uses graph embedding, Many real-world systems can be represented as graphs where the in particular graph autoencoders. Graph convolutional network different entities are presented by nodes and their interactions by (GCN) based methods such as Graph autoencoders (GAE) [13], variational edges. An important task in studying large datasets is graph clustering. GAE (VGAE) [14], adversarially regularized graph autoencoder While there has been a lot of work on graph clustering using the (ARGA), adversarially regularized variational graph autoencoder connectivity between the nodes, many real-world networks also have (ARVGA) [15] and marginalized graph autoencoder for graph node attributes. Clustering attributed graphs requires joint modeling clustering (MGAE) [16] have demonstrated state-of-the-art performance of graph structure and node attributes. Recent work has focused on on several attributed graph clustering tasks. These methods graph convolutional networks and graph convolutional filters to combine directly use GCN as a feature extractor, where each convolutional structural and content information. However, these methods are layer is equivalent to a first-order low-pass graph filter that only takes mostly limited to lowpass filtering and do not explicitly optimize the the most immediate neighbors of each node into account.