Clustering
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
Schubert, Erich, Zimek, Arthur
This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality.
Spectral-Spatial Diffusion Geometry for Hyperspectral Image Clustering
Murphy, James M., Maggioni, Mauro
An unsupervised learning algorithm to cluster hyperspectral image (HSI) data is proposed that exploits spatially-regularized random walks. Markov diffusions are defined on the space of HSI spectra with transitions constrained to near spatial neighbors. The explicit incorporation of spatial regularity into the diffusion construction leads to smoother random processes that are more adapted for unsupervised machine learning than those based on spectra alone. The regularized diffusion process is subsequently used to embed the high-dimensional HSI into a lower dimensional space through diffusion distances. Cluster modes are computed using density estimation and diffusion distances, and all other points are labeled according to these modes. The proposed method has low computational complexity and performs competitively against state-of-the-art HSI clustering algorithms on real data. In particular, the proposed spatial regularization confers an empirical advantage over non-regularized methods.
Improving Deep Image Clustering With Spatial Transformer Layers
Souza, Thiago V. M., Zanchettin, Cleber
Deep image clustering is a recent research area, but with exciting published works [15]. The approaches use the most diverse architectures varying the structure of the deep networks, theclustering algorithms and the combination of both parts. Approachessuch as the Deep Clustering Network (DCN) [9] use a pretrained autoencoder combined with the k-means algorithm. Methods such as Joint Unsupervised Learning (JULE) [10] combines deep convolutional networks with hierarchical clustering. Deep Embbed Cluster (DEC) [11], also uses a pretrained autoencoder, then removes the decoder part and uses the encoder as a feature extractor to feed the clustering method. After that, the network is fine-tuned using the cluster assignment hardening loss. Meanwhile, the clusters are iteratively tuned by minimizing the KL-divergence between the distribution of soft labels and the auxiliary target distribution.
Bounded Fuzzy Possibilistic Method
This paper introduces Bounded Fuzzy Possibilistic Method (BFPM) by addressing several issues that previous clustering/classification methods have not considered. In fuzzy clustering, object's membership values should sum to 1. Hence, any object may obtain full membership in at most one cluster. Possibilistic clustering methods remove this restriction. However, BFPM differs from previous fuzzy and possibilistic clustering approaches by allowing the membership function to take larger values with respect to all clusters. Furthermore, in BFPM, a data object can have full membership in multiple clusters or even in all clusters. BFPM relaxes the boundary conditions (restrictions) in membership assignment. The proposed methodology satisfies the necessity of obtaining full memberships and overcomes the issues with conventional methods on dealing with overlapping. Analysing the objects' movements from their own cluster to another (mutation) is also proposed in this paper. BFPM has been applied in different domains in geometry, set theory, anomaly detection, risk management, diagnosis diseases, and other disciplines. Validity and comparison indexes have been also used to evaluate the accuracy of BFPM. BFPM has been evaluated in terms of accuracy, fuzzification constant (different norms), objects' movement analysis, and covering diversity. The promising results prove the importance of considering the proposed methodology in learning methods to track the behaviour of data objects, in addition to obtain accurate results.
Online Clustering by Penalized Weighted GMM
With the dawn of the Big Data era, data sets are growing rapidly. Data is streaming from everywhere - from cameras, mobile phones, cars, and other electronic devices. Clustering streaming data is a very challenging problem. Unlike the traditional clustering algorithms where the dataset can be stored and scanned multiple times, clustering streaming data has to satisfy constraints such as limit memory size, real-time response, unknown data statistics and an unknown number of clusters. In this paper, we present a novel online clustering algorithm which can be used to cluster streaming data without knowing the number of clusters a priori. Results on both synthetic and real datasets show that the proposed algorithm produces partitions which are close to what you could get if you clustered the whole data at one time.
Nearly Optimal Dynamic $k$-Means Clustering for High-Dimensional Data
Hu, Wei, Song, Zhao, Yang, Lin F., Zhong, Peilin
We consider the $k$-means clustering problem in the dynamic streaming setting, where points from a discrete Euclidean space $\{1, 2, \ldots, \Delta\}^d$ can be dynamically inserted to or deleted from the dataset. For this problem, we provide a one-pass coreset construction algorithm using space $\tilde{O}(k\cdot \mathrm{poly}(d, \log\Delta))$, where $k$ is the target number of centers. To our knowledge, this is the first dynamic geometric data stream algorithm for $k$-means using space polynomial in dimension and nearly optimal (linear) in $k$.
Automating Interpretability: Discovering and Testing Visual Concepts Learned by Neural Networks
Ghorbani, Amirata, Wexler, James, Kim, Been
Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Due to it's complexity, i For high-stakes domains such as medical, providing intuitive explanations that can be consumed by domain experts without ML expertise becomes crucial. To this demand, concept-based methods (e.g., TCAV) were introduced to provide explanations using user-chosen high-level concepts rather than individual input features. While these methods successfully leverage rich representations learned by the networks to reveal how human-defined concepts are related to the prediction, they require users to select concepts of their choice and collect labeled examples of those concepts. In this work, we introduce DTCAV (Discovery TCAV) a global concept-based interpretability method that can automatically discover concepts as image segments, along with each concept's estimated importance for a deep neural network's predictions. We validate that discovered concepts are as coherent to humans as hand-labeled concepts. We also show that the discovered concepts carry significant signal for prediction by analyzing a network's performance with stitched/added/deleted concepts. DTCAV results revealed a number of undesirable correlations (e.g., a basketball player's jersey was a more important concept for predicting the basketball class than the ball itself) and show the potential shallow reasoning of these networks.
An Automated Spectral Clustering for Multi-scale Data
Afzalan, Milad, Jazizadeh, Farrokh
Spectral clustering algorithms typically require a priori selection of input parameters such as the number of clusters, a scaling parameter for the affinity measure, or ranges of these values for parameter tuning. Despite efforts for automating the process of spectral clustering, the task of grouping data in multi-scale and higher dimensional spaces is yet to be explored. This study presents a spectral clustering heuristic algorithm that obviates the need for an input by estimating the parameters from the data itself. Specifically, it introduces the heuristic of iterative eigengap search with (1) global scaling and (2) local scaling. These approaches estimate the scaling parameter and implement iterative eigengap quantification along a search tree to reveal dissimilarities at different scales of a feature space and identify clusters. The performance of these approaches has been tested on various real-world datasets of power variation with multi-scale nature and gene expression. Our findings show that iterative eigengap search with a PCA-based global scaling scheme can discover different patterns with an accuracy of higher than 90% in most cases without asking for a priori input information.
Hyperbox based machine learning algorithms: A comprehensive survey
Khuat, Thanh Tung, Ruta, Dymitr, Gabrys, Bogdan
With the rapid development of digital information, the data volume generated by humans and machines is growing exponentially. Along with this trend, machine learning algorithms have been formed and evolved continuously to discover new information and knowledge from different data sources. Learning algorithms using hyperboxes as fundamental representational and building blocks are a branch of machine learning methods. These algorithms have enormous potential for high scalability and online adaptation of predictors built using hyperbox data representations to the dynamically changing environments and streaming data. This paper aims to give a comprehensive survey of literature on hyperbox-based machine learning models. In general, according to the architecture and characteristic features of the resulting models, the existing hyperbox-based learning algorithms may be grouped into three major categories: fuzzy min-max neural networks, hyperbox-based hybrid models, and other algorithms based on hyperbox representation. Within each of these groups, this paper shows a brief description of the structure of models, associated learning algorithms, and an analysis of their advantages and drawbacks. Main applications of these hyperbox-based models to the real-world problems are also described in this paper. Finally, we discuss some open problems and identify potential future research directions in this field.
Spaces of Clusterings
Rolle, Alexander, Scoccola, Luis
Often, a clustering algorithm, rather than producing a single clustering of a dataset, produces a set of clusterings. For example, one gets a set of clusterings by running a clustering algorithm with a range of parameters, or with many initializations. Given a set S of clusterings of a dataset X, one may want to know how many different kinds of clusterings the set S contains, ignoring small differences between elements of S. In effect, one may want to cluster S. This paper proposes two clustering algorithms, specifically for use on sets of clusterings of a fixed dataset. The starting point is the observation that sets of clusterings have geometric structure.Indeed, there are many ways, described in the literature, to define a metric on the set of all clusterings of a fixed dataset, and it is a natural idea to use such metrics to cluster a set of clusterings.