Goto

Collaborating Authors

 Clustering


Hierarchical Stochastic Graphlet Embedding for Graph-based Pattern Recognition

arXiv.org Machine Learning

Despite being very successful within the pattern recognition and machine learning community, graph-based methods are often unusable with many machine learning tools. This is because of the incompatibility of most of the mathematical operations in graph domain. Graph embedding has been proposed as a way to tackle these difficulties, which maps graphs to a vector space and makes the standard machine learning techniques applicable for them. However, it is well known that graph embedding techniques usually suffer from the loss of structural information. In this paper, given a graph, we consider its hierarchical structure for mapping it into a vector space. The hierarchical structure is constructed by topologically clustering the graph nodes, and considering each cluster as a node in the upper hierarchical level. Once this hierarchical structure of graph is constructed, we consider its various configurations of its parts, and use stochastic graphlet embedding (SGE) for mapping them into vector space. Broadly speaking, SGE produces a distribution of uniformly sampled low to high order graphlets as a way to embed graphs into the vector space. In what follows, the coarse-to-fine structure of a graph hierarchy and the statistics fetched through the distribution of low to high order stochastic graphlets complements each other and include important structural information with varied contexts. Altogether, these two techniques substantially cope with the usual information loss involved in graph embedding techniques, and it is not a surprise that we obtain more robust vector space embedding of graphs. This fact has been corroborated through a detailed experimental evaluation on various benchmark graph datasets, where we outperform the state-of-the-art methods.


The modal age of Statistics

arXiv.org Machine Learning

The mean-median-mode trio involves the three most frequently used measures of central tendency of a dataset. They are taught within the very first classes of any course on basic Statistics. However, they do not share the same degree of importance: the sample mean (or average) is normally well understood and employed in everyday situations, the sample median is also useful and easy to visualize, but the mode, usually defined as the value of the dataset having the highest frequency of appearance, looks like a more bizarre measure of location. This uneven treatment was already noted by Dalenius (1965), but it keeps being present as of today, to some extent. Indeed, when the dataset consists of realizations from a continuous random variable then all the observed values are different with probability one and, therefore, the mode does not even make much sense.


Certifying Global Optimality of Graph Cuts via Semidefinite Relaxation: A Performance Guarantee for Spectral Clustering

arXiv.org Machine Learning

Spectral clustering has become one of the most widely used clustering techniques when the structure of the individual clusters is non-convex or highly anisotropic. Yet, despite its immense popularity, there exists fairly little theory about performance guarantees for spectral clustering. This issue is partly due to the fact that spectral clustering typically involves two steps which complicated its theoretical analysis: first, the eigenvectors of the associated graph Laplacian are used to embed the dataset, and second, k-means clustering algorithm is applied to the embedded dataset to get the labels. This paper is devoted to the theoretical foundations of spectral clustering and graph cuts. We consider a convex relaxation of graph cuts, namely ratio cuts and normalized cuts, that makes the usual two-step approach of spectral clustering obsolete and at the same time gives rise to a rigorous theoretical analysis of graph cuts and spectral clustering. We derive deterministic bounds for successful spectral clustering via a spectral proximity condition that naturally depends on the algebraic connectivity of each cluster and the inter-cluster connectivity. Moreover, we demonstrate by means of some popular examples that our bounds can achieve near-optimality. Our findings are also fundamental for the theoretical understanding of kernel k-means. Numerical simulations confirm and complement our analysis.


Temporal graph-based clustering for historical record linkage

arXiv.org Artificial Intelligence

Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains are linked and integrated to allow advanced analytics. A popular type of data used in such a context are historical censuses, as well as birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be conducted. Specifically, it is impossible to track individuals, families, or households over time. Once such data sets are linked and family trees spanning several decades are available it is possible to, for example, investigate how education, health, mobility, employment, and social status influence each other and the lives of people over two or even more generations. A major challenge is however the accurate linkage of historical data sets which is due to data quality and commonly also the lack of ground truth data being available. Unsupervised techniques need to be employed, which can be based on similarity graphs generated by comparing individual records. In this paper we present initial results from clustering birth records from Scotland where we aim to identify all births of the same mother and group siblings into clusters. We extend an existing clustering technique for record linkage by incorporating temporal constraints that must hold between births by the same mother, and propose a novel greedy temporal clustering technique. Experimental results show improvements over non-temporary approaches, however further work is needed to obtain links of high quality.


Supervised Fuzzy Partitioning

arXiv.org Machine Learning

Centroid-based methods including k-means and fuzzy c-means are known as effective and easy-to-implement approaches to clustering purposes in many areas of application. However, these algorithms cannot be directly applied to supervised tasks. We propose a generative model extending centroid-based clustering approaches to be applicable to classification tasks. Given an arbitrary loss function, our approach, termed Supervised Fuzzy Partitioning (SFP), incorporates labels information into its objective function through a surrogate term penalizing the risk. We also fuzzify the partition and assign weights to features alongside entropy-based regularization terms, enabling the method to capture more complex data structure, to identify significant features, and to yield better performance facing high-dimensional data. An iterative algorithm based on block coordinate descent scheme was formulated to efficiently find a local optimizer. The results show that the SFP performance in classification of ultra high-dimensional gene expression data is competitive with state-of-the-art algorithms such as random forest and SVM. Our method has a major advantage over such methods in that it not only leads to a flexible model suitable for high-dimensional cases but also uses the loss function in training phase without compromising computational efficiency.


Clustering with Temporal Constraints on Spatio-Temporal Data of Human Mobility

arXiv.org Machine Learning

Extracting significant places or places of interest (POIs) using individuals' spatio-temporal data is of fundamental importance for human mobility analysis. Classical clustering methods have been used in prior work for detecting POIs, but without considering temporal constraints. Usually, the involved parameters for clustering are difficult to determine, e.g., the optimal cluster number in hierarchical clustering. Currently, researchers either choose heuristic values or use spatial distance-based optimization to determine an appropriate parameter set. We argue that existing research does not optimally address temporal information and thus leaves much room for improvement. Considering temporal constraints in human mobility, we introduce an effective clustering approach - namely POI clustering with temporal constraints (PC-TC) - to extract POIs from spatio-temporal data of human mobility. Following human mobility nature in modern society, our approach aims to extract both global POIs (e.g., workplace or university) and local POIs (e.g., library, lab, and canteen). Based on two publicly available datasets including 193 individuals, our evaluation results show that PC-TC has much potential for next place prediction in terms of granularity (i.e., the number of extracted POIs) and predictability.


On embeddings as an alternative paradigm for relational learning

arXiv.org Artificial Intelligence

Many real-world domains can be expressed as graphs and, more generally, as multi-relational knowledge graphs. Though reasoning and learning with knowledge graphs has traditionally been addressed by symbolic approaches, recent methods in (deep) representation learning has shown promising results for specialized tasks such as knowledge base completion. These approaches abandon the traditional symbolic paradigm by replacing symbols with vectors in Euclidean space. With few exceptions, symbolic and distributional approaches are explored in different communities and little is known about their respective strengths and weaknesses. In this work, we compare representation learning and relational learning on various relational classification and clustering tasks, and analyse the complexity of the rules used implicitly by these approaches. Preliminary results reveal possible indicators that could help in choosing one approach over the other for particular knowledge graphs.


Comparing Graph Clusterings: Set partition measures vs. Graph-aware measures

arXiv.org Machine Learning

An impressive number of graph clustering algorithms have been proposed, studied and compared over the past decades [4,10,17,19,21,23,25]. To identify better graph clustering techniques, one needs a way to score the techniques against one another. A typical method is to compare values of some similarity measure between ground truth partitions of given graphs and the partitions produced by the different algorithms on those graphs. However, the choice of the similarity measure used is crucial and has a huge impact on the conclusions made. In graph clustering comparison studies [8,13,18,28], set partition similarities are used as accuracy measures.


Grapevine: A Wine Prediction Algorithm Using Multi-dimensional Clustering Methods

arXiv.org Machine Learning

Wine has incredible diversity; there exist over 10,000 different varieties of wine grapes worldwide, and each can be processed in a hundred thousand unique ways. Sommeliers-- those who dedicate their lives to the art of wine tasting-- work to craft flavor profiles for the wines they are given to analyze, using their extensive experience to provide nuanced evaluations of countless bottles of wine every year. But the majority of people have neither the time nor the money to try a variety of wines and develop their palate. Typically, the only claim one can make about a given glass of wine is whether or not it was enjoyable, and without the ability to identify ones taste preferences in wine, it is incredibly difficult for one to discover new wine, and nearly impossible to find wine that directly matches their individual flavor profile. We hope to develop an algorithm to address both of these issues, becoming a personal sommelier for the user. Our algorithm takes a history of the wine a user has tasted as input, and returns a set of optimal wines for the user to try next, as well as a description of the flavor profile that inspired the recommendations. Thus, the algorithm could become an avenue for the user to confidently explore wine, and understand more quickly what they do and do not like in wine. Formally, we define our problem as an unsupervised learning problem.


Global Bigdata Conference

#artificialintelligence

On the other hand, K-Means has a couple of disadvantages. Firstly, you have to select how many groups/classes there are. This isn't always trivial and ideally with a clustering algorithm we'd want it to figure those out for us because the point of it is to gain some insight from the data. K-means also starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency.