Clustering
Detection of Thin Boundaries between Different Types of Anomalies in Outlier Detection using Enhanced Neural Networks
Kiani, Rasoul, Keshavarzi, Amin, Bohlouli, Mahdi
Outlier detection has received special attention in various fields, mainly for those dealing with machine learning and artificial intelligence. As strong outliers, anomalies are divided into the point, contextual and collective outliers. The most important challenges in outlier detection include the thin boundary between the remote points and natural area, the tendency of new data and noise to mimic the real data, unlabelled datasets and different definitions for outliers in different applications. Considering the stated challenges, we defined new types of anomalies called Collective Normal Anomaly and Collective Point Anomaly in order to improve a much better detection of the thin boundary between different types of anomalies. Basic domain-independent methods are introduced to detect these defined anomalies in both unsupervised and supervised datasets. The Multi-Layer Perceptron Neural Network is enhanced using the Genetic Algorithm to detect newly defined anomalies with higher precision so as to ensure a test error less than that calculated for the conventional Multi-Layer Perceptron Neural Network. Experimental results on benchmark datasets indicated reduced error of anomaly detection process in comparison to baselines.
Towards Automatic Clustering Analysis using Traces of Information Gain: The InfoGuide Method
Rocha, Paulo, Pinheiro, Diego, Cadeiras, Martin, Bastos-Filho, Carmelo
Clustering analysis has become a ubiquitous information retrieval tool in a wide range of domains, but a more automatic framework is still lacking. Though internal metrics are the key players towards a successful retrieval of clusters, their effectiveness on real-world datasets remains not fully understood, mainly because of their unrealistic assumptions underlying datasets. We hypothesized that capturing {\it traces of information gain} between increasingly complex clustering retrievals---{\it InfoGuide}---enables an automatic clustering analysis with improved clustering retrievals. We validated the {\it InfoGuide} hypothesis by capturing the traces of information gain using the Kolmogorov-Smirnov statistic and comparing the clusters retrieved by {\it InfoGuide} against those retrieved by other commonly used internal metrics in artificially-generated, benchmarks, and real-world datasets. Our results suggested that {\it InfoGuide} can enable a more automatic clustering analysis and may be more suitable for retrieving clusters in real-world datasets displaying nontrivial statistical properties.
Coarse-Grain Cluster Analysis of Tensors With Application to Climate Biome Identification
DeSantis, Derek, Wolfram, Phillip J., Bennett, Katrina, Alexandrov, Boian
A tensor provides a concise way to codify the interdependence of complex data. Treating a tensor as a d-way array, each entry records the interaction between the different indices. Clustering provides a way to parse the complexity of the data into more readily understandable information. Clustering methods are heavily dependent on the algorithm of choice, as well as the chosen hyperparameters of the algorithm. However, their sensitivity to data scales is largely unknown. In this work, we apply the discrete wavelet transform to analyze the effects of coarse-graining on clustering tensor data. We are particularly interested in understanding how scale effects clustering of the Earth's climate system. The discrete wavelet transform allows classification of the Earth's climate across a multitude of spatial-temporal scales. The discrete wavelet transform is used to produce an ensemble of classification estimates, as opposed to a single classification. Using information theory, we discover a sub-collection of the ensemble that span the majority of the variance observed, allowing for efficient consensus clustering techniques that can be used to identify climate biomes.
Heterogeneous Transfer Learning in Ensemble Clustering
This work proposes an ensemble clustering method using transfer learning approach. We consider a clustering problem, in which in addition to data under consideration, "similar" labeled data are available. The datasets can be described with different features. The method is based on constructing meta-features which describe structural characteristics of data, and their transfer from source to target domain. An experimental study of the method using Monte Carlo modeling has confirmed its efficiency. In comparison with other similar methods, the proposed one is able to work under arbitrary feature descriptions of source and target domains; it has smaller complexity.
Randomized Spectral Clustering in Large-Scale Stochastic Block Models
Zhang, Hai, Guo, Xiao, Chang, Xiangyu
Spectral clustering has been one of the widely used methods for community detection in networks. However, large-scale networks bring computational challenge to it. In this paper, we study spectral clustering using randomized sketching algorithms from a statistical perspective, where we typically assume the network data are generated from a stochastic block model. To do this, we first use the recent developed sketching algorithms to derive two randomized spectral clustering algorithms, namely, the random projection-based and the random sampling-based spectral clustering. Then we study the theoretical bounds of the resulting algorithms in terms of the approximation error for the population adjacency matrix, the misclustering error, and the estimation error for the link probability matrix. It turns out that, under mild conditions, the randomized spectral clustering algorithms perform similarly to the original one. We also conduct numerical experiments to support the theoretical findings.
Multi-Level Representation Learning for Deep Subspace Clustering
Kheirandishfard, Mohsen, Zohrizadeh, Fariba, Kamangar, Farhad
This paper proposes a novel deep subspace clustering approach which uses convolutional autoencoders to transform input images into new representations lying on a union of linear subspaces. The first contribution of our work is to insert multiple fully-connected linear layers between the encoder layers and their corresponding decoder layers to promote learning more favorable representations for subspace clustering. These connection layers facilitate the feature learning procedure by combining low-level and high-level information for generating multiple sets of self-expressive and informative representations at different levels of the encoder. Moreover, we introduce a novel loss minimization problem which leverages an initial clustering of the samples to effectively fuse the multi-level representations and recover the underlying subspaces more accurately. The loss function is then minimized through an iterative scheme which alternatively updates the network parameters and produces new clusterings of the samples. Experiments on four real-world datasets demonstrate that our approach exhibits superior performance compared to the state-of-the-art methods on most of the subspace clustering problems.
Learning Options from Demonstration using Skill Segmentation
Cockcroft, Matthew, Mawjee, Shahil, James, Steven, Ranchod, Pravesh
We present a method for learning options from segmented demonstration trajectories. The trajectories are first segmented into skills using nonparametric Bayesian clustering and a reward function for each segment is then learned using inverse reinforcement learning. From this, a set of inferred trajectories for the demonstration are generated. Option initiation sets and termination conditions are learned from these trajectories using the one-class support vector machine clustering algorithm. We demonstrate our method in the four rooms domain, where an agent is able to autonomously discover usable options from human demonstration. Our results show that these inferred options can then be used to improve learning and planning.
A Classification-Based Approach to Semi-Supervised Clustering with Pairwise Constraints
ลmieja, Marek, Struski, ลukasz, Figueiredo, Mรกrio A. T.
A Classification-Based Approach to Semi-Supervised Clustering with Pairwise Constraints Marek Smieja a,, ลukasz Struski a, Mรกrio A. T. Figueiredo b a Faculty of Mathematics and Computer Science, Jagiellonian University, Krakรณw, Poland b Instituto de T elecomunicaรงรตes, Instituto Superior Tรฉcnico, Universidade de Lisboa, Lisbon, PortugalAbstract In this paper, we introduce a neural network framework for semi-supervised clustering (SSC) with pairwise (must-link or cannot-link) constraints. In contrast to existing approaches, we decompose SSC into two simpler classification tasks/stages: the first stage uses a pair of Siamese neural networks to label the unlabeled pairs of points as must-link or cannot-link; the second stage uses the fully pairwise-labeled dataset produced by the first stage in a supervised neural-network-based clustering method. The proposed approach, S 3 C 2 (Semi-Supervised Siamese C lassifiers for C lustering), is motivated by the observation that binary classification (such as assigning pairwise relations) is usually easier than multi-class clustering with partial supervision. On the other hand, being classification-based, our method solves only well-defined classification problems, rather than less well specified clustering tasks. Extensive experiments on various datasets demonstrate the high performance of the proposed method. Keywords: semi-supervised clustering, deep learning, neural networks, pairwise constraints 1. Introduction Clustering is an important unsupervised learning tool often used to analyze the structure of complex high-dimensional data. Semi-supervised clustering (SSC) methods tackle this issue by leveraging partial prior information about class labels, with the goal of obtaining partitions that are better aligned with true classes [1, 2, 3, 4, 5, 6]. One typical way of injecting class label information into clustering is in the form of pairwise constraints (typically, must-link and cannot-link constraints), or pairwise preferences (e.g., should-link and shouldn't-link), which indicate whether a given pair of points is believed to belong to the same or different classes. Most SSC approaches rely on adapting existing unsupervised clustering methods to handle partial (namely, pairwise) information [7, 8, 4, 5, 6, 9]. This requires transferring class-label knowledge into a clustering algorithm, which is often unnatural and puts a higher weight on clustering structure than on class labels.
A survey on Machine Learning-based Performance Improvement of Wireless Networks: PHY, MAC and Network layer
Kulin, Merima, Kazaz, Tarik, Moerman, Ingrid, de Poorter, Eli
This paper provides a systematic and comprehensive survey that reviews the latest research efforts focused on machine learning (ML) based performance improvement of wireless networks, while considering all layers of the protocol stack (PHY, MAC and network). First, the related work and paper contributions are discussed, followed by providing the necessary background on data-driven approaches and machine learning for non-machine learning experts to understand all discussed techniques. Then, a comprehensive review is presented on works employing ML-based approaches to optimize the wireless communication parameters settings to achieve improved network quality-of-service (QoS) and quality-of-experience (QoE). We first categorize these works into: radio analysis, MAC analysis and network prediction approaches, followed by subcategories within each. Finally, open challenges and broader perspectives are discussed.
Neighborhood Structure Assisted Non-negative Matrix Factorization and its Application in Unsupervised Point Anomaly Detection
Ahmed, Imtiaz, Hu, Xia Ben, Acharya, Mithun P., Ding, Yu
Dimensionality reduction is considered as an important step for ensuring competitive performance in unsupervised learning such as anomaly detection. Non-negative matrix factorization (NMF) is a popular and widely used method to accomplish this goal. But NMF, together with its recent, enhanced version, like graph regularized NMF or symmetric NMF, do not have the provision to include the neighborhood structure information and, as a result, may fail to provide satisfactory performance in presence of nonlinear manifold structure. To address that shortcoming, we propose to consider and incorporate the neighborhood structural similarity information within the NMF framework by modeling the data through a minimum spanning tree. What motivates our choice is the understanding that in the presence of complicated data structure, a minimum spanning tree can approximate the intrinsic distance between two data points better than a simple Euclidean distance does, and consequently, it constitutes a more reasonable basis for differentiating anomalies from the normal class data. We label the resulting method as the neighborhood structure assisted NMF. By comparing the formulation and properties of the neighborhood structure assisted NMF with other versions of NMF including graph regularized NMF and symmetric NMF, it is apparent that the inclusion of the neighborhood structure information using minimum spanning tree makes a key difference. We further devise both offline and online algorithmic versions of the proposed method. Empirical comparisons using twenty benchmark datasets as well as an industrial dataset extracted from a hydropower plant demonstrate the superiority of the neighborhood structure assisted NMF and support our claim of merit.