Goto

Collaborating Authors

 Clustering


Hermitian matrices for clustering directed graphs: insights and applications

arXiv.org Machine Learning

Graph clustering is a basic technique in machine learning, and has widespread applications in different domains. While spectral techniques have been successfully applied for clustering undirected graphs, the performance of spectral clustering algorithms for directed graphs (digraphs) is not in general satisfactory: these algorithms usually require symmetrising the matrix representing a digraph, and typical objective functions for undirected graph clustering do not capture cluster-structures in which the information given by the direction of the edges is crucial. To overcome these downsides, we propose a spectral clustering algorithm based on a complex-valued matrix representation of digraphs. We analyse its theoretical performance on a Stochastic Block Model for digraphs in which the cluster-structure is given not only by variations in edge densities, but also by the direction of the edges. The significance of our work is highlighted on a data set pertaining to internal migration in the United States: while previous spectral clustering algorithms for digraphs can only reveal that people are more likely to move between counties that are geographically close, our approach is able to cluster together counties with a similar socio-economical profile even when they are geographically distant, and illustrates how people tend to move from rural to more urbanised areas.


Some Developments in Clustering Analysis on Stochastic Processes

arXiv.org Machine Learning

Some Developments in Clustering Analysis on Stochastic Processes Qidi Peng Nan Rao † Ran Zhao ‡ Abstract We review some developments on clustering stochastic processes and come with the conclusion that asymptotically consistent clustering algorithms can be obtained when the processes are ergodic and the dissimilarity measure satisfies the triangle inequality. Examples are provided when the processes are distribution ergodic, covariance ergodic and locally asymptotically self-similar, respectively. Keywords: stochastic process, unsupervised clustering, stationary ergodic processes, local asymptotic self-similarity 1 Introduction A stochastic process is an infinite sequence of random variables indexed by "time". The time indexes can be either discrete or continuous. Stochastic process type data have been broadly explored in biological and medical research (Damian et al., 2007; Zhao et al., 2014; J a askinen et al., 2014; et al., 2018).


Simultaneous Clustering and Optimization for Evolving Datasets

arXiv.org Machine Learning

For any i such that 1 i 6, A i represents an instance of the dataset, X i represents the corresponding optimization variable, v i represents a vertex of graph G, and e ij represents the edge connecting v i and v j. heuristic rules used in traditional clustering methods. A formulation of convex clustering was proposed in [13] by relaxing the formulation of k-means clustering. Subsequently, [15] and [16] provided several sufficient conditions for recovering the clustering membership theoretically . Other studies, e.g., [8], [17], focus on improving the efficiency of convex clustering. Although those previous studies attained great improvement of convex clustering for static datasets, they are unsuitable for handling evolving datasets due to a high computational cost. The method proposed in the paper reduces such computational cost and makes a good tradeoff between efficiency and accuracy .


Linear Dynamics: Clustering without identification

arXiv.org Machine Learning

Clustering time series is a delicate task; varying lengths and temporal offsets obscure direct comparisons. A natural strategy is to learn a parametric model foreach time series and to cluster the model parameters rather than the sequences themselves. Linear dynamical systems are a fundamental and powerful parametric model class. However, identifying the parameters of a linear dynamical systems is a venerable task, permitting provably efficient solutions only in special cases. In this work, we show that clustering the parameters of unknown linear dynamical systems is, in fact, easier than identifying them. We analyze a computationally efficient clustering algorithm that enjoys provable convergence guarantees under a natural separation assumption. Although easy to implement, our algorithm is general, handling multi-dimensional data with time offsets and partial sequences. Evaluating our algorithm on both synthetic data and real electrocardiogram (ECG) signals, we see significant improvements in clustering quality over existing baselines.


Large-Scale Sparse Subspace Clustering Using Landmarks

arXiv.org Machine Learning

Subspace clustering methods based on expressing each data point as a linear combination of all other points in a dataset are popular unsupervised learning techniques. However, existing methods incur high computational complexity on large-scale datasets as they require solving an expensive optimization problem and performing spectral clustering on large affinity matrices. This paper presents an efficient approach to subspace clustering by selecting a small subset of the input data called landmarks. The resulting subspace clustering method in the reduced domain runs in linear time with respect to the size of the original data. Numerical experiments on synthetic and real data demonstrate the effectiveness of our method.


CyberPoint · Blog · Learning in the Dark: Lessons Learned in Unsupervised Learning

#artificialintelligence

CyberPoint has seen great success in using supervised machine learning for malware detection. A while back, however, some colleagues and I set out to investigate whether we could make any interesting discoveries by applying unsupervised learning to CyberPoint's malware dataset. In supervised learning, one has a set of samples, each with an assigned label. In the field of malware analysis, a sample would typically be a file, and its label might be either benign or the malware family to which it belongs. The goal is: given a new sample, correctly predict its label.


A novel framework of the fuzzy c-means distances problem based weighted distance

arXiv.org Machine Learning

A novel framework of the fuzzy c-means distances problem based weighted distance Andy Arief Setyawan a,1,, Ahmad Ilham b,1 a Department of Information and Communication, Pemalang District Government, Pemalang, Indonesia b Department of Informatics, Universitas Muhammadiyah Semarang, Semarang 50354, Indonesia Abstract Clustering is one of the major roles in data mining that is widely application in pattern recognition and image segmentation. Fuzzy C-means (FCM) is the most used clustering algorithm that proven efficient, fast and easy to implement, however FCM uses the Euclidean distance that often leads to clustering errors, especially when handling multidimensional and noisy data. In the last few years, many distances metric have been propose by researchers to improve the performance of the FCM algorithms, and the majority of researchers propose weighted distance. In this paper, we proposed Canberra Weighted Distance to improved performance of the FCM algorithm. Experimental result using the UCI data set show the proposed method is superior to the original method and other clustering methods. Keywords: clustering, fuzzy c-means, euclidean distance, weighted distance, canberra distance 1. Introduction Cluster analysis or clustering is the process of partitioning a set of data objects into subset or clusters, where the objects in a cluster is similar to onenull This document is a collaborative effort by Intelligent Systems Research Group Indonesia and Informatics Department Universitas Muhammadiyah Semarang.


A Novel Multiple Classifier Generation and Combination Framework Based on Fuzzy Clustering and Individualized Ensemble Construction

arXiv.org Machine Learning

--Multiple classifier system (MCS) has become a successful alternative for improving classification performance. However, studies have shown inconsistent results for different MCSs, and it is often difficult to predict which MCS algorithm works the best on a particular problem. We believe that the two crucial steps of MCS - base classifier generation and multiple classifier combination, need to be designed coordinately to produce robust results. In this work, we show that for different testing instances, better classifiers may be trained from different subdomains of training instances including, for example, neighboring instances of the testing instance, or even instances far away from the testing instance. T o utilize this intuition, we propose Individualized Classifier Ensemble (ICE). ICE groups training data into overlapping clusters, builds a classifier for each cluster, and then associates each training instance to the top-performing models while taking into account model types and frequency. In testing, ICE finds the k most similar training instances for a testing instance, then predicts class label of the testing instance by averaging the prediction from models associated with these training instances. Evaluation results on 49 benchmarks show that ICE has a stable improvement on a significant proportion of datasets over existing MCS methods. ICE provides a novel choice of utilizing internal patterns among instances to improve classification, and can be easily combined with various classification models and applied to many application domains.


A comparative study of general fuzzy min-max neural networks for pattern classification problems

arXiv.org Machine Learning

--General fuzzy min-max (GFMM) neural network is a generalization of fuzzy neural networks formed by hyperbox fuzzy sets for classification and clustering problems. Two principle algorithms are deployed to train this type of neural network, i.e., incremental learning and agglomerative learning. This paper presents a comprehensive empirical study of performance influencing factors, advantages, and drawbacks of the general fuzzy min-max neural network on pattern classification problems. The subjects of this study include (1) the impact of maximum hyperbox size, (2) the influence of the similarity threshold and measures on the agglomerative learning algorithm, (3) the effect of data presentation order, (4) comparative performance evaluation of the GFMM with other types of fuzzy min-max neural networks and prevalent machine learning algorithms. The experimental results on benchmark datasets widely used in machine learning showed overall strong and weak points of the GFMM classifier . These outcomes also informed potential research directions for this class of machine learning algorithms in the future. Pattern classification, which belongs to the class of supervised learning, aims to discover information and knowledge under data through taking advantage of the power of learning algorithms [1]. It plays a crucial role in many real-world applications ranging from medical diagnostic [2], electronic devices [3] to tourism [4] and energy [5]. Multidimensional hyperbox fuzzy sets can be used to deal with the pattern classification problems effectively by partitioning the pattern space and assigning a class label associated with a degree of certainty for each region. Each fuzzy min-max hyperbox is represented by minimum and maximum points along with a fuzzy membership function. The membership function is employed to compute the degree-of-fit of each input sample to a given hyperbox. Meanwhile, the hyperbox is continuously adjusted during the training process to cover the input patterns. Simpson was the first one who formulated a fuzzy min-max neural network (FMNN) using hyperbox representations and proposed the training algorithms for classification [6] and clustering [7] problems. Since then, many researchers have paid attention to enhancing the performance of the FMNN and addressing some of its major drawbacks. Recent surveys [8], [9] on the FMNN have divided modified variants into two groups, i.e., fuzzy min-max networks with and without contraction process. Representatives of improved models removing the contraction procedure from the training algorithms and replacing it with particular neurons for overlapping regions among hyperboxes comprise the inclusion/exclusion fuzzy hyperbox classifier [10], the fuzzy min-max neural network with compensatory neuron [11], the data-core-based FMM neural network [12], and the multilevel FMM neural network [13].


A Temporal Clustering Algorithm for Achieving the trade-off between the User Experience and the Equipment Economy in the Context of IoT

arXiv.org Machine Learning

We present here the Temporal Clustering Algorithm (TCA), an incremental learning algorithm applicable to problems of anticipatory computing in the context of the Internet of Things. This algorithm was tested in a specific prediction scenario of consumption of an electric water dispenser typically used in tropical countries, in which the ambient temperature is around 30-degree Celsius. In this context, the user typically wants to drinking iced water therefore uses the cooler function of the dispenser. Real and synthetic water consumption data was used to test a forecasting capacity on how much energy can be saved by predicting the pattern of use of the equipment. In addition to using a small constant amount of memory, which allows the algorithm to be implemented at the lowest cost, while using microcontrollers with a small amount of memory (less than 1Kbyte) available on the market. The algorithm can also be configured according to user preference, prioritizing comfort, keeping the water at the desired temperature longer, or prioritizing energy savings. The main result is that the TCA achieved energy savings of up to 40% compared to the conventional mode of operation of the dispenser with an average success rate higher than 90% in its times of use.