Clustering
Community structure: A comparative evaluation of community detection methods
Dao, Vinh-Loc, Bothorel, Cécile, Lenca, Philippe
Discovering community structure in complex networks is a mature field since a tremendous number of community detection methods have been introduced in the literature. Nevertheless, it is still very challenging for practioners to determine which method would be suitable to get insights into the structural information of the networks they study. Many recent efforts have been devoted to investigating various quality scores of the community structure, but the problem of distinguishing between different types of communities is still open. In this paper, we propose a comparative, extensive and empirical study to investigate what types of communities many state-of-the-art and well-known community detection methods are producing. Specifically, we provide comprehensive analyses on computation time, community size distribution, a comparative evaluation of methods according to their optimisation schemes as well as a comparison of their partioning strategy through validation metrics. We process our analyses on a very large corpus of hundreds of networks from five different network categories and propose ways to classify community detection methods, helping a potential user to navigate the complex landscape of community detection.
A comparison of cluster algorithms as applied to unsupervised surveys
Garwood, Kathleen Campbell, D., Ph., Dhobale, Arpit Arun
Often survey analysis collects data to try to identify response patterns leading to groupings of respondents with different characteristics as revealed by answers provided to survey questions. Without additional background information on respondents, it is often very difficult (and many times impossible) to verify the accuracy of groupings resulting from the analysis. This paper examines one such situation in which high school students in low-income neighbourhood schools in Bolivia responded to a standard periodic institutional survey and responses were analysed to better understand respondents' socioeconomic contexts. In this case study, the question to be answered was "can we identify the most impoverished students based on a 22 questions standard survey alone?". With no known dependent variable and an inability to objectively capture the socioeconomic condition of the students being surveyed, the task of coming to a conclusive answer becomes unfeasible as there is no way to validate at least some portion of the students identified as most impoverished.
Kernel Treelets
Xia, Hedi, Ceniceros, Hector D.
Treelets, introduced by Lee, Nadler, and Wasserman [1, 2], is a method to produce a multiscale, hierarchicaldecomposition of unordered data. The central premise of Treelets is to exploit sparsity and capture intrinsic localized structures with only a few features, represented interms of an orthonormal basis. The hierarchical tree constructed by the treelet algorithm provides a scale-based partition of the data that can be used for classification, specially for cluster analysis [3]. Cluster analysis, also called clustering, is concerned with finding a partition of a set such that its corresponding equivalence class captures similarity of its elements. The Treelet approach is an example of hierarchical clustering (HC) [4], which is a type of methods that provides a nested and multiscale clustering.
Deep Density-based Image Clustering
Ren, Yazhou, Wang, Ni, Li, Mingxia, Xu, Zenglin
Recently, deep clustering, which is able to perform feature learning that favors clustering tasks via deep neural networks, has achieved remarkable performance in image clustering applications. However, the existing deep clustering algorithms generally need the number of clusters in advance, which is usually unknown in real-world tasks. In addition, the initial cluster centers in the learned feature space are generated by $k$-means. This only works well on spherical clusters and probably leads to unstable clustering results. In this paper, we propose a two-stage deep density-based image clustering (DDC) framework to address these issues. The first stage is to train a deep convolutional autoencoder (CAE) to extract low-dimensional feature representations from high-dimensional image data, and then apply t-SNE to further reduce the data to a 2-dimensional space favoring density-based clustering algorithms. The second stage is to apply the developed density-based clustering technique on the 2-dimensional embedded data to automatically recognize an appropriate number of clusters with arbitrary shapes. Concretely, a number of local clusters are generated to capture the local structures of clusters, and then are merged via their density relationship to form the final clustering result. Experiments demonstrate that the proposed DDC achieves comparable or even better clustering performance than state-of-the-art deep clustering methods, even though the number of clusters is not given.
Ramp-based Twin Support Vector Clustering
Wang, Zhen, Chen, Xu, Li, Chun-Na, Shao, Yuan-Hai
Traditional plane-based clustering methods measure the cost of within-cluster and between-cluster by quadratic, linear or some other unbounded functions, which may amplify the impact of cost. This letter introduces a ramp cost function into the plane-based clustering to propose a new clustering method, called ramp-based twin support vector clustering (RampTWSVC). RampTWSVC is more robust because of its boundness, and thus it is more easier to find the intrinsic clusters than other plane-based clustering methods. The non-convex programming problem in RampTWSVC is solved efficiently through an alternating iteration algorithm, and its local solution can be obtained in a finite number of iterations theoretically. In addition, the nonlinear manifold-based formation of RampTWSVC is also proposed by kernel trick. Experimental results on several benchmark datasets show the better performance of our RampTWSVC compared with other plane-based clustering methods.
A matching based clustering algorithm for categorical data
Gevorgyan, Ruben A., Hakobyan, Yenok B.
Ruben A. Gevorgyan · Y enok B. Hakobyan Abstract Cluster analysis is one of the essential tasks in data mining and knowledge discovery. Each type of data poses unique challenges in achieving relatively efficient partitioning of the data into homogeneous groups. While the algorithms for numeric data are relatively well studied in the literature, there are still challenges to address in case of categorical data. The main issue is the unordered structure of categorical data, which makes the implementation of the standard concepts of clustering algorithms difficult. For instance, the assessment of distance between objects, the selection of representatives for categorical data is not as straightforward as for numeric data. Therefore, this paper presents a new framework for partitioning categorical data, which does not use the distance measure as a key concept. The Matching based clustering algorithm is designed based on the similarity matrix and a framework for updating the latter using the feature importance criteria. The experimental results show this algorithm can serve as an alternative to existing ones and can be an efficient knowledge discovery tool. Keywords categorical data · clustering algorithm · similarity matrix · feature importance Mathematics Subject Classification (2010) 62H30 · 62H17 · 62H20 Ruben A Gevorgyan Faculty of Economics and Management, Y erevan State University, Alex Manukyan 1, 0025 Y erevan, Republic of Armenia Email: rubengevorgyan@ysu.am Y enok B. Hakobyan Faculty of Economics and Management, Y erevan State University, Alex Manukyan 1, 0025 Y erevan, Republic of Armenia Email: e.hakobyan@ysu.am 1 Introduction Cluster analysis is one of the "super problems"s in data mining. Generally speaking, clustering is partitioning data points into intuitively similar groups (Saxena et al. 2017).
On balanced clustering with tree-like structures over clusters
The article addresses balanced clustering problems with an additional requirement as a tree-like structure over the obtained balanced clusters. This kind of clustering problems can be useful in some applications (e.g., network design, management and routing). Various types of the initial elements are considered. Four basic greedy-like solving strategies (design framework) are considered: balancing-spanning strategy, spanning-balancing strategy, direct strategy, and design of layered structures with balancing. An extended description of the spanning-balancing strategy is presented including four solving schemes and an illustrative numerical example.
Learning Graph Representation via Formal Concept Analysis
Yoneda, Yuka, Sugiyama, Mahito, Washio, Takashi
We present a novel method that can learn a graph representation from multivariate data. In our representation, each node represents a cluster of data points and each edge represents the subset-superset relationship between clusters, which can be mutually overlapped. The key to our method is to use formal concept analysis (FCA), which can extract hierarchical relationships between clusters based on the algebraic closedness property. We empirically show that our method can effectively extract hierarchical structures of clusters compared to the baseline method.
Change Point Estimation in a Dynamic Stochastic Block Model
Bhattacharjee, Monika, Banerjee, Moulinath, Michailidis, George
We consider the problem of estimating the location of a single change point in a dynamic stochastic block model. We propose two methods of estimating the change point, together with the model parameters. The first employs a least squares criterion function and takes into consideration the full structure of the stochastic block model and is evaluated at each point in time. Hence, as an intermediate step, it requires estimating the community structure based on a clustering algorithm at every time point. The second method comprises of the following two steps: in the first one, a least squares function is used and evaluated at each time point, but ignores the community structures and just considers a random graph generating mechanism exhibiting a change point. Once the change point is identified, in the second step, all network data before and after it are used together with a clustering algorithm to obtain the corresponding community structures and subsequently estimate the generating stochastic block model parameters. A comparison between these two methods is illustrated. Further, for both methods under their respective identifiability and certain additional regularity conditions, we establish rates of convergence and derive the asymptotic distributions of the change point estimators. The results are illustrated on synthetic data.
Scope of Research on Particle Swarm Optimization Based Data Clustering
Metre, Vishakha A, Deshmukh, Mr Pramod B
Optimization is nothing but a mathematical technique which finds maxima or minima of any function of concern in some realistic region. Different optimization techniques are proposed which are competing for the best solution. Particle Swarm Optimization (PSO) is a new, advanced, and most powerful optimization methodology that performs empirically well on several optimization problems. It is the extensively used Swarm Intelligence (SI) inspired optimization algorithm used for finding the global optimal solution in a multifaceted search region. Data clustering is one of the challenging real world applications that invite the eminent research works in variety of fields. Applicability of different PSO variants to data clustering is studied in the literature, and the analyzed research work shows that, PSO variants give poor results for multidimensional data. This paper describes the different challenges associated with multidimensional data clustering and scope of research on optimizing the clustering problems using PSO. We also propose a strategy to use hybrid PSO variant for clustering multidimensional numerical, text and image data.