Clustering
Latent Dirichlet Allocation
Latent Dirichlet Allocation, or LDA for short, is an unsupervised machine learning algorithm. Similar to the clustering algorithm K-means, LDA will attempt to group words and documents into a predefined number of clusters (i.e. These topics can then be used to organize and search through documents. The most popular methods for estimating the LDA model is Gibbs sampling. Let's walk through one iteration of the algorithm.
Late Fusion Multi-view Clustering via Global and Local Alignment Maximization
Wang, Siwei, Liu, Xinwang, Zhu, En
Multi-view clustering (MVC) optimally integrates complementary information from different views to improve clustering performance. Although demonstrating promising performance in various applications, most of existing approaches directly fuse multiple pre-specified similarities to learn an optimal similarity matrix for clustering, which could cause over-complicated optimization and intensive computational cost. In this paper, we propose late fusion MVC via alignment maximization to address these issues. To do so, we first reveal the theoretical connection of existing k-means clustering and the alignment between base partitions and the consensus one. Based on this observation, we propose a simple but effective multi-view algorithm termed LF-MVC-GAM. It optimally fuses multiple source information in partition level from each individual view, and maximally aligns the consensus partition with these weighted base ones. Such an alignment is beneficial to integrate partition level information and significantly reduce the computational complexity by sufficiently simplifying the optimization procedure. We then design another variant, LF-MVC-LAM to further improve the clustering performance by preserving the local intrinsic structure among multiple partition spaces. After that, we develop two three-step iterative algorithms to solve the resultant optimization problems with theoretically guaranteed convergence. Further, we provide the generalization error bound analysis of the proposed algorithms. Extensive experiments on eighteen multi-view benchmark datasets demonstrate the effectiveness and efficiency of the proposed LF-MVC-GAM and LF-MVC-LAM, ranging from small to large-scale data items. The codes of the proposed algorithms are publicly available at https://github.com/wangsiwei2010/latefusionalignment.
Interpretable Time Series Clustering Using Local Explanations
Ozyegen, Ozan, Prayogo, Nicholas, Cevik, Mucahit, Basar, Ayse
This study focuses on exploring the use of local interpretability methods for explaining time series clustering models. Many of the state-of-the-art clustering models are not directly explainable. To provide explanations for these clustering algorithms, we train classification models to estimate the cluster labels. Then, we use interpretability methods to explain the decisions of the classification models. The explanations are used to obtain insights into the clustering models. We perform a detailed numerical study to test the proposed approach on multiple datasets, clustering models, and classification models. The analysis of the results shows that the proposed approach can be used to explain time series clustering models, specifically when the underlying classification model is accurate. Lastly, we provide a detailed analysis of the results, discussing how our approach can be used in a real-life scenario.
Evo* 2022 -- Late-Breaking Abstracts Volume
Mora, A. M., Esparcia-Alcรกzar, A. I.
This volume contains the Late-Breaking Abstracts accepted at Evo* 2022 Conference, held in Madrid (Spain), from 20 to 22 of April. They were also presented as short talks as well as at the conference's poster session. The works present ongoing research and preliminary results investigating on the application of different approaches of Evolutionary Computation and other Nature-Inspired techniques to different problems, most of them real world ones. These are very promising contributions, since they outline some of the incoming advances and applications in the area of nature-inspired methods, mainly Evolutionary Algorithms.
A Small Survey On Event Detection Using Twitter
This is evident from popular phenomena such as effects of fake news and online social movements. However the the data obtained from social media presents itself with large volume and velocity, accompanied by significant amount of irrelevant data pertaining to general discussions, personal messages and spam. Social media has been shown to be effective for detecting, forecasting and tracking real world events. The ability to detect real world events is crucial and has applications in disease surveillance, commerce, governance and other areas. Thus extraction of useful information and modelling the characteristics of social media to detect real world events is an important problem. 2 RESEARCH PROBLEM To outline the research problem we need to define events, which has multiple interpretations.
Bayesian nonparametric mixture inconsistency for the number of components: How worried should we be in practice?
Chaumeny, Yannis, Moris, Johan van der Molen, Davison, Anthony C., Kirk, Paul D. W.
We consider the Bayesian mixture of finite mixtures (MFMs) and Dirichlet process mixture (DPM) models for clustering. Recent asymptotic theory has established that DPMs overestimate the number of clusters for large samples and that estimators from both classes of models are inconsistent for the number of clusters under misspecification, but the implications for finite sample analyses are unclear. The final reported estimate after fitting these models is often a single representative clustering obtained using an MCMC summarisation technique, but it is unknown how well such a summary estimates the number of clusters. Here we investigate these practical considerations through simulations and an application to gene expression data, and find that (i) DPMs overestimate the number of clusters even in finite samples, but only to a limited degree that may be correctable using appropriate summaries, and (ii) misspecification can lead to considerable overestimation of the number of clusters in both DPMs and MFMs, but results are nevertheless often still interpretable. We provide recommendations on MCMC summarisation and suggest that although the more appealing asymptotic properties of MFMs provide strong motivation to prefer them, results obtained using MFMs and DPMs are often very similar in practice.
Stochastic Parallelizable Eigengap Dilation for Large Graph Clustering
van der Pol, Elise, Gemp, Ian, Bachrach, Yoram, Everett, Richard
Large graphs commonly appear in social networks, knowledge graphs, recommender systems, life sciences, and decision making problems. Summarizing large graphs by their high level properties is helpful in solving problems in these settings. In spectral clustering, we aim to identify clusters of nodes where most edges fall within clusters and only few edges fall between clusters. This task is important for many downstream applications and exploratory analysis. A core step of spectral clustering is performing an eigendecomposition of the corresponding graph Laplacian matrix (or equivalently, a singular value decomposition, SVD, of the incidence matrix). The convergence of iterative singular value decomposition approaches depends on the eigengaps of the spectrum of the given matrix, i.e., the difference between consecutive eigenvalues. For a graph Laplacian corresponding to a well-clustered graph, the eigenvalues will be non-negative but very small (much less than $1$) slowing convergence. This paper introduces a parallelizable approach to dilating the spectrum in order to accelerate SVD solvers and in turn, spectral clustering. This is accomplished via polynomial approximations to matrix operations that favorably transform the spectrum of a matrix without changing its eigenvectors. Experiments demonstrate that this approach significantly accelerates convergence, and we explain how this transformation can be parallelized and stochastically approximated to scale with available compute.
Introduction to K-means Clustering
This article will answer these questions. Apart from all this, we will also learn more about K-means clustering and its implementation by defining K-means fit function. Clustering is an unsupervised learning technique. It is used to group different data points based on similar features or characteristics. For example, A company wants to know to whom they should display a particular ad such the chances of clicking it increases.
Expanding the class of global objective functions for dissimilarity-based hierarchical clustering
Background In hierarchical clustering, one seeks a recursive partitioning of the data that captures clustering information at different levels of granularity. Classical work on the subject mostly takes an algorithmic perspective. In particular, various iterative clustering methods have been developed, including the well-known bottom-up dissimilarity-based approaches single linkage, average linkage, etc. (see, e.g., [Mur12, Chapter 25]). Recent work on dissimilarity-based hierarchical clustering has emphasized a different, optimization-based, perspective. This has led to the introduction of global objective functions for this classical problem [Das16].
Online Inference for Mixture Model of Streaming Graph Signals with Non-White Excitation
This paper considers a joint multi-graph inference and clustering problem for simultaneous inference of node centrality and association of graph signals with their graphs. We study a mixture model of filtered low pass graph signals with possibly non-white and low-rank excitation. While the mixture model is motivated from practical scenarios, it presents significant challenges to prior graph learning methods. As a remedy, we consider an inference problem focusing on the node centrality of graphs. We design an expectation-maximization (EM) algorithm with a unique low-rank plus sparse prior derived from low pass signal property. We propose a novel online EM algorithm for inference from streaming data. As an example, we extend the online algorithm to detect if the signals are generated from an abnormal graph. We show that the proposed algorithms converge to a stationary point of the maximum-a-posterior (MAP) problem. Numerical experiments support our analysis.