Clustering
Big Data Meet Cyber-Physical Systems: A Panoramic Survey
Atat, Rachad, Liu, Lingjia, Wu, Jinsong, Li, Guangyu, Ye, Chunxuan, Yi, Yang
The world is witnessing an unprecedented growth of cyber-physical systems (CPS), which are foreseen to revolutionize our world {via} creating new services and applications in a variety of sectors such as environmental monitoring, mobile-health systems, intelligent transportation systems and so on. The {information and communication technology }(ICT) sector is experiencing a significant growth in { data} traffic, driven by the widespread usage of smartphones, tablets and video streaming, along with the significant growth of sensors deployments that are anticipated in the near future. {It} is expected to outstandingly increase the growth rate of raw sensed data. In this paper, we present the CPS taxonomy {via} providing a broad overview of data collection, storage, access, processing and analysis. Compared with other survey papers, this is the first panoramic survey on big data for CPS, where our objective is to provide a panoramic summary of different CPS aspects. Furthermore, CPS {require} cybersecurity to protect {them} against malicious attacks and unauthorized intrusion, which {become} a challenge with the enormous amount of data that is continuously being generated in the network. {Thus, we also} provide an overview of the different security solutions proposed for CPS big data storage, access and analytics. We also discuss big data meeting green challenges in the contexts of CPS.
Semi-crowdsourced Clustering with Deep Generative Models
Luo, Yucen, Tian, Tian, Shi, Jiaxin, Zhu, Jun, Zhang, Bo
We consider the semi-supervised clustering problem where crowdsourcing provides noisy information about the pairwise comparisons on a small subset of data, i.e., whether a sample pair is in the same cluster. We propose a new approach that includes a deep generative model (DGM) to characterize low-level features of the data, and a statistical relational model for noisy pairwise annotations on its subset. The two parts share the latent variables. To make the model automatically trade-off between its complexity and fitting data, we also develop its fully Bayesian variant. The challenge of inference is addressed by fast (natural-gradient) stochastic variational inference algorithms, where we effectively combine variational message passing for the relational part and amortized learning of the DGM under a unified framework. Empirical results on synthetic and real-world datasets show that our model outperforms previous crowdsourced clustering methods.
Probabilistic Multilevel Clustering via Composite Transportation Distance
Ho, Nhat, Huynh, Viet, Phung, Dinh, Jordan, Michael I.
Clustering is a classic and fundamental problem in machine learning. Popular clustering methods such as K-means and the EM algorithm have been the workhorses of exploratory data analysis. However, the underlying model for such methods is a simple flat partition or a mixture model, which do not capture multilevel structures (e.g., words are grouped into documents, documents are grouped into corpora) that arise in many applications in the physical, biological or cognitive sciences. The clustering of multilevel structured data calls for novel methodologies beyond classical clustering. One natural approach for capturing multilevel structures is to use a hierarchy in which data are clustered locally into groups, and those groups are partitioned in a "global clustering." Attempts to develop algorithms of this kind can be roughly classified into two categories. The first category makes use of probabilistic models, often based on Dirichlet process priors.
Time series clustering based on the characterisation of segment typologies
Guijo-Rubio, David, Durรกn-Rosal, Antonio Manuel, Gutiรฉrrez, Pedro Antonio, Troncoso, Alicia, Hervรกs-Martรญnez, Cรฉsar
Time series clustering is the process of grouping time series with respect to their similarity or characteristics. Previous approaches usually combine a specific distance measure for time series and a standard clustering method. However, these approaches do not take the similarity of the different subsequences of each time series into account, which can be used to better compare the time series objects of the dataset. In this paper, we propose a novel technique of time series clustering based on two clustering stages. In a first step, a least squares polynomial segmentation procedure is applied to each time series, which is based on a growing window technique that returns different-length segments. Then, all the segments are projected into same dimensional space, based on the coefficients of the model that approximates the segment and a set of statistical features. After mapping, a first hierarchical clustering phase is applied to all mapped segments, returning groups of segments for each time series. These clusters are used to represent all time series in the same dimensional space, after defining another specific mapping process. In a second and final clustering stage, all the time series objects are grouped. We consider internal clustering quality to automatically adjust the main parameter of the algorithm, which is an error threshold for the segmenta- tion. The results obtained on 84 datasets from the UCR Time Series Classification Archive have been compared against two state-of-the-art methods, showing that the performance of this methodology is very promising.
Fully Supervised Speaker Diarization
Zhang, Aonan, Wang, Quan, Zhu, Zhenyao, Paisley, John, Wang, Chong
In this paper, we propose a fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN). Given extracted speaker-discriminative embeddings (a.k.a. d-vectors) from input utterances, each individual speaker is modeled by a parameter-sharing RNN, while the RNN states for different speakers interleave in the time domain. This RNN is naturally integrated with a distance-dependent Chinese restaurant process (ddCRP) to accommodate an unknown number of speakers. Our system is fully supervised and is able to learn from examples where time-stamped speaker labels are annotated. We achieved a 7.6% diarization error rate on NIST SRE 2000 CALLHOME, which is better than the state-of-the-art method using spectral clustering. Moreover, our method decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
Lossless (and Lossy) Compression of Random Forests
Painsky, Amichai, Rosset, Saharon
Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on random forests. Our suggested method is based on probabilistic modeling of the ensemble's trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble. In addition, we introduce a theoretically sound lossy compression scheme, which allows us to control the trade-off between the distortion and the coding rate.
Geometry and clustering with metrics derived from separable Bregman divergences
Gomes-Gonรงalves, Erika, Gzyl, Henryk, Nielsen, Frank
Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances.
Spectral Embedding Norm: Looking Deep into the Spectrum of the Graph Laplacian
The extraction of clusters from a dataset which includes multiple clusters and another significant portion of "background" samples is a task of practical importance. The traditional spectral clustering algorithm, relying on the leading $K$ eigenvectors to detect the $K$ clusters, fails in such cases. This paper proposes the spectral embedding norm which sums the squared values of the first $I$ (normalized) eigenvectors, where $I$ can be larger than $K$. We prove that the quantity can be used to separate clusters from the background under generic conditions motivated by applications such as anomaly detection. The performance of the algorithm is not sensitive to the choice of $I$, and we present experiments on synthetic and real-world datasets.
Graph Laplacian mixture model
Maretic, Hermina Petric, Frossard, Pascal
Graph learning methods have recently been receiving increasing interest as means to infer structure in datasets. Most of the recent approaches focus on different relationships between a graph and data sample distributions, mostly in settings where all available relate to the same graph. This is, however, not always the case, as data is often available in mixed form, yielding the need for methods that are able to cope with mixture data and learn multiple graphs. We propose a novel generative model that explains a collection of distinct data naturally living on different graphs. We assume the mapping of data to graphs is not known and investigate the problem of jointly clustering a set of data and learning a graph for each of the clusters. Experiments in both synthetic and real-world datasets demonstrate promising performance both in terms of data clustering, as well as multiple graph inference from mixture data.
Iterative Initial Centroid Search via Sampling for k-Means Clustering
In this post, we will look at using an iterative approach to searching for a better set of initial centroids for k-means clustering, and will do so by performing this process on a sample of our full dataset. What do we mean by "better?" Since k-means clustering aims to converge on an optimal set of cluster centers (centroids) and cluster membership based on distance from these centroids via successive iterations, it is intuitive that the more optimal the positioning of these initial centroids, the fewer iterations of the k-means clustering algorithms will be required for convergence. Therefore, thinking about ways to find a better set of initial centroid positions is a valid approach to optimizing the k-means clustering process. What we will do differently, specifically, is to draw a sample of data from our full dataset, and run short runs of of the k-means clustering algorithm on it (not to convergence), short runs which will include, out of necessity, the centroid initialization process.