Clustering
Impacts of Dirty Data: and Experimental Evaluation
Qi, Zhixin, Wang, Hongzhi, Li, Jianzhong, Gao, Hong
Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification, clustering, and regression algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.
Fast Subspace Clustering Based on the Kronecker Product
Zhou, Lei, Bai, Xiao, Liu, Xianglong, Zhou, Jun, Edwin, Hancock
Subspace clustering is a useful technique for many computer vision applications in which the intrinsic dimension of high-dimensional data is often smaller than the ambient dimension. Spectral clustering, as one of the main approaches to subspace clustering, often takes on a sparse representation or a low-rank representation to learn a block diagonal self-representation matrix for subspace generation. However, existing methods require solving a large scale convex optimization problem with a large set of data, with computational complexity reaches O(N^3) for N data points. Therefore, the efficiency and scalability of traditional spectral clustering methods can not be guaranteed for large scale datasets. In this paper, we propose a subspace clustering model based on the Kronecker product. Due to the property that the Kronecker product of a block diagonal matrix with any other matrix is still a block diagonal matrix, we can efficiently learn the representation matrix which is formed by the Kronecker product of k smaller matrices. By doing so, our model significantly reduces the computational complexity to O(kN^{3/k}). Furthermore, our model is general in nature, and can be adapted to different regularization based subspace clustering methods. Experimental results on two public datasets show that our model significantly improves the efficiency compared with several state-of-the-art methods. Moreover, we have conducted experiments on synthetic data to verify the scalability of our model for large scale datasets.
Analysis of spectral clustering algorithms for community detection: the general bipartite setting
We consider the analysis of spectral clustering algorithms for community detection under a stochastic block model (SBM). A general spectral clustering algorithm consists of three steps: (1) regularization of an appropriate adjacency or Laplacian matrix (2) a form of spectral truncation and (3) a k-means type algorithm in the reduced spectral domain. By varying each step, one can obtain different spectral algorithms. In light of the recent developments in refining consistency results for the spectral clustering, we identify the necessary bounds at each of these three steps, and then derive and compare consistency results for some existing spectral algorithms as well as a new variant that we propose. The focus of the paper is on providing a better understanding of the analysis of spectral methods for community detection, with an emphasis on the bipartite setting which has received less theoretical consideration. We show how the variations in the spectral truncation step reflects in the consistency results under a general SBM. We also investigate the necessary bounds for the k-means step in some detail, allowing one to replace this step with any algorithm (k-means type or otherwise) that guarantees the necessary bound. We discuss some of the neglected aspects of the bipartite setting, e.g., the role of the mismatch between the communities of the two sides on the performance of spectral methods. Finally, we show how the consistency results can be extended beyond SBMs to the problem of clustering inhomogeneous random graph models that can be approximated by SBMs in a certain sense.
A Comprehensive Introduction to Clustering Methods โ Shairoz Sohail โ Medium
When our data is relatively clean and low-dimensional, looking at a table of summary statistics or some scatter plots can usually reveal how good clustering would be on the data. Look for things like large'clumps' of points in scatter plots between features, large variances, large differences between median and mean, properties of data between quantiles etc.
Coordinating Measurements in Uncertain Participatory Sensing Settings
Zenonos, Alexandros, Stein, Sebastian, Jennings, Nicholas R.
Environmental monitoring allows authorities to understand the impact of potentially harmful phenomena, such as air pollution, excessive noise, and radiation. Recently, there has been considerable interest in participatory sensing as a paradigm for such large-scale data collection because it is cost-effective and able to capture more fine-grained data than traditional approaches that use stationary sensors scattered in cities. In this approach, ordinary citizens (non-expert contributors) collect environmental data using low-cost mobile devices. However, these participants are generally self-interested actors that have their own goals and make local decisions about when and where to take measurements. This can lead to highly inefficient outcomes, where observations are either taken redundantly or do not provide sufficient information about key areas of interest. To address these challenges, it is necessary to guide and to coordinate participants, so they take measurements when it is most informative. To this end, we develop a computationally-efficient coordination algorithm (adaptive Best-Match) that suggests to users when and where to take measurements. Our algorithm exploits probabilistic knowledge of human mobility patterns, but explicitly considers the uncertainty of these patterns and the potential unwillingness of people to take measurements when requested to do so. In particular, our algorithm uses a local search technique, clustering and random simulations to map participants to measurements that need to be taken in space and time. We empirically evaluate our algorithm on a real-world human mobility and air quality dataset and show that it outperforms the current state of the art by up to 24% in terms of utility gained.
Applicability and interpretation of the deterministic weighted cepstral distance
Lauwers, Oliver, De Moor, Bart
Quantifying similarity between data objects is an important part of modern data science. Deciding what similarity measure to use is very application dependent. In this paper, we combine insights from systems theory and machine learning, and investigate the weighted cepstral distance, which was previously defined for signals coming from ARMA models. We provide an extension of this distance to invertible deterministic linear time invariant single input single output models, and assess its applicability. We show that it can always be interpreted in terms of the poles and zeros of the underlying model, and that, in the case of stable, minimum-phase, or unstable, maximum-phase models, a geometrical interpretation in terms of subspace angles can be given. We then devise a method to assess stability and phase-type of the generating models, using only input/output signal information. In this way, we prove a connection between the extended weighted cepstral distance and a weighted cepstral model norm. In this way, we provide a purely data-driven way to assess different underlying dynamics of input/output signal pairs, without the need for any system identification step. This can be useful in machine learning tasks such as time series clustering. An iPython tutorial is published complementary to this paper, containing implementations of the various methods and algorithms presented here, as well as some numerical illustrations of the equivalences proven here.
Mixtures of Matrix Variate Bilinear Factor Analyzers
Gallaugher, Michael P. B., McNicholas, Paul D.
Dimensionality is an ever present concern with data becoming increasingly higher dimensional over the last few years. To combat this issue, dimension reduction techniques have become very important tools, especially in the area of clustering (unsupervised classification) as well as semi-supervised and supervised classification. For multivariate data, the mixture of factor analyzers model has proved to be very useful in this regard as the model performs clustering and dimension reduction simultaneously, details in Section 2. However, there is 1 relative paucity in the area of dimension reduction for use in model-based clustering for matrix variate data. Matrix variate distributions have been shown to be useful for modelling three way data such as images and multivariate longitudinal data; however, the methods presented in the literature do suffer from dimensionality concerns. In this paper, we present a mixture of matrix variate bilinear factor analyzers (MMVBFA) model for use in clustering for higher dimensional matrix data. The matrix variate bilinear factor analyzers model can be viewed as a generalization of bilinear principal component analysis (BPCA; Zhao et al. 2012), and contains BPCA as a special case. An alternating expectation conditional maximization (AECM) algorithm (Meng & van Dyk 1997) is used for parameter estimation. The proposed method is illustrated using both simulated and real datasets.
On the Existence of a Sample Mean in Dynamic Time Warping Spaces
Jain, Brijnesh J., Schultz, David
The concept of sample mean in dynamic time warping (DTW) spaces has been successfully applied to improve pattern recognition systems and generalize centroid-based clustering algorithms. Its existence has neither been proved nor challenged. This article presents sufficient conditions for existence of a sample mean in DTW spaces. The proposed result justifies prior work on approximate mean algorithms, sets the stage for constructing exact mean algorithms, and is a first step towards a statistical theory of DTW spaces.
The Power Mean Laplacian for Multilayer Graph Clustering
Mercado, Pedro, Gautier, Antoine, Tudisco, Francesco, Hein, Matthias
The Power Mean Laplacian for Multilayer Graph ClusteringPedro Mercado 1 Antoine Gautier 1 Francesco Tudisco 2 Matthias Hein 1 1 Department of Mathematics and Computer Science, Saarland University, Germany 2 Department of Mathematics and Statistics, University of Strathclyde, G11XH Glasgow, UK Abstract Multilayer graphs encode different kind of interactions between the same set of entities. When one wants to cluster such a multilayer graph, the natural question arises how one should merge the information from different layers. We introduce in this paper a one-parameter family of matrix power means for merging the Laplacians from different layers and analyze it in expectation in the stochastic block model. We show that this family allows to recover ground truth clusters under different settings and verify this in real world data. While computing the matrix power mean can be very expensive for large graphs, we introduce a numerical scheme to efficiently compute its eigenvectors for the ...
Model-Based Clustering and Classification of Functional Data
Chamroukhi, Faicel, Nguyen, Hien D.
The problem of complex data analysis is a central topic of modern statistical science and learning systems and is becoming of broader interest with the increasing prevalence of high-dimensional data. The challenge is to develop statistical models and autonomous algorithms that are able to acquire knowledge from raw data for exploratory analysis, which can be achieved through clustering techniques or to make predictions of future data via classification (i.e., discriminant analysis) techniques. Latent data models, including mixture model-based approaches are one of the most popular and successful approaches in both the unsupervised context (i.e., clustering) and the supervised one (i.e, classification or discrimination). Although traditionally tools of multivariate analysis, they are growing in popularity when considered in the framework of functional data analysis (FDA). FDA is the data analysis paradigm in which the individual data units are functions (e.g., curves, surfaces), rather than simple vectors. In many areas of application, the analyzed data are indeed often available in the form of discretized values of functions or curves (e.g., time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data). This functional aspect of the data adds additional difficulties compared to the case of a classical multivariate (non-functional) data analysis. We review and present approaches for model-based clustering and classification of functional data. We derive well-established statistical models along with efficient algorithmic tools to address problems regarding the clustering and the classification of these high-dimensional data, including their heterogeneity, missing information, and dynamical hidden structure. The presented models and algorithms are illustrated on real-world functional data analysis problems from several application area.