Goto

Collaborating Authors

 Clustering


Multi-View Fuzzy Clustering with The Alternative Learning between Shared Hidden Space and Partition

arXiv.org Artificial Intelligence

As the multi-view data grows in the real world, multi-view clus-tering has become a prominent technique in data mining, pattern recognition, and machine learning. How to exploit the relation-ship between different views effectively using the characteristic of multi-view data has become a crucial challenge. Aiming at this, a hidden space sharing multi-view fuzzy clustering (HSS-MVFC) method is proposed in the present study. This method is based on the classical fuzzy c-means clustering model, and obtains associ-ated information between different views by introducing shared hidden space. Especially, the shared hidden space and the fuzzy partition can be learned alternatively and contribute to each other. Meanwhile, the proposed method uses maximum entropy strategy to control the weights of different views while learning the shared hidden space. The experimental result shows that the proposed multi-view clustering method has better performance than many related clustering methods.


RWR-GAE: Random Walk Regularization for Graph Auto Encoders

arXiv.org Machine Learning

Node embeddings have become an ubiquitous technique for representing graph data in a low dimensional space. Graph autoencoders, as one of the widely adapted deep models, have been proposed to learn graph embeddings in an unsupervised way by minimizing the reconstruction error for the graph data. However, its reconstruction loss ignores the distribution of the latent representation, and thus leading to inferior embeddings. To mitigate this problem, we propose a random walk based method to regularize the representations learnt by the encoder. We show that the proposed novel enhancement beats the existing state-of-the-art models by a large margin (upto 7.5\%) for node clustering task, and achieves state-of-the-art accuracy on the link prediction task for three standard datasets, cora, citeseer and pubmed. Code available at https://github.com/MysteryVaibhav/DW-GAE.


Multi-view Clustering with the Cooperation of Visible and Hidden Views

arXiv.org Artificial Intelligence

Multi-view data are becoming common in real-world modeling tasks and many multi-view data clustering algorithms have thus been proposed. The existing algorithms usually focus on the cooperation of different views in the original space but neglect the influence of the hidden information among these different visible views, or they only consider the hidden information between the views. The algorithms are therefore not efficient since the available information is not fully excavated, particularly the otherness information in different views and the consistency information between them. In practice, the otherness and consistency information in multi-view data are both very useful for effective clustering analyses. In this study, a Multi-View clustering algorithm developed with the Cooperation of Visible and Hidden views, i.e., MV-Co-VH, is proposed. The MV-Co-VH algorithm first projects the multiple views from different visible spaces to the common hidden space by using the non-negative matrix factorization (NMF) strategy to obtain the common hidden view data. Collaborative learning is then implemented in the clustering procedure based on the visible views and the shared hidden view. The results of extensive experiments on UCI multi-view datasets and real-world image multi-view datasets show that the clustering performance of the proposed algorithm is competitive with or even better than that of the existing algorithms.


On Defending Against Label Flipping Attacks on Malware Detection Systems

arXiv.org Artificial Intelligence

Label manipulation attacks are a subclass of data poisoning attacks in adversarial machine learning used against different applications, such as malware detection. These types of attacks represent a serious threat to detection systems in environments having high noise rate or uncertainty, such as complex networks and Internet of Thing (IoT). Recent work in the literature has suggested using the $K$-Nearest Neighboring (KNN) algorithm to defend against such attacks. However, such an approach can suffer from low to wrong detection accuracy. In this paper, we design an architecture to tackle the Android malware detection problem in IoT systems. We develop an attack mechanism based on Silhouette clustering method, modified for mobile Android platforms. We proposed two Convolutional Neural Network (CNN)-type deep learning algorithms against this \emph{Silhouette Clustering-based Label Flipping Attack (SCLFA)}. We show the effectiveness of these two defense algorithms - \emph{Label-based Semi-supervised Defense (LSD)} and \emph{clustering-based Semi-supervised Defense (CSD)} - in correcting labels being attacked. We evaluate the performance of the proposed algorithms by varying the various machine learning parameters on three Android datasets: Drebin, Contagio, and Genome and three types of features: API, intent, and permission. Our evaluation shows that using random forest feature selection and varying ratios of features can result in an improvement of up to 19\% accuracy when compared with the state-of-the-art method in the literature.


A Critical Note on the Evaluation of Clustering Algorithms

arXiv.org Machine Learning

Experimental evaluation is a major research methodology for investigating clustering algorithms. For this purpose, a number of benchmark datasets have been widely used in the literature and their quality plays an important role on the value of the research work. However, in most of the existing studies, little attention has been paid to the specific properties of the datasets and they are often regarded as black-box problems. In our work, with the help of advanced visualization and dimension reduction techniques, we show that there are potential issues with some of the popular benchmark datasets used to evaluate clustering algorithms that may seriously compromise the research quality and even may produce completely misleading results. We suggest that significant efforts need to be devoted to improving the current practice of experimental evaluation of clustering algorithms by having a principled analysis of each benchmark dataset of interest.


Unexpected Effects of Online K-means Clustering

arXiv.org Machine Learning

In this paper we study k-means clustering in the online setting. In the offline setting the main parameters are number of centers, k, and size of the dataset, n. Performance guarantees are given as a function of these parameters. In the online setting new factors come into place: the ordering of the dataset and whether n is known in advance or not. One of the main results of this paper is the discovery that these new factors have dramatic effects on the quality of the clustering algorithms. For example, for constant k: (1) $\Omega(n)$ centers are needed if the order is arbitrary, (2) if the order is random and n is unknown in advance, the number of centers reduces to $\Theta(logn)$, and (3) if n is known, then the number of centers reduces to a constant. For different values of the new factors, we show upper and lower bounds that are exactly the same up to a constant, thus achieving optimal bounds.


Deep Kernel Learning for Clustering

arXiv.org Machine Learning

We propose a deep learning approach for discovering kernels tailored to identifying clusters over sample data. Our neural network produces sample embeddings that are motivated by--and are at least as expressive as--spectral clustering. Our training objective, based on the Hilbert Schmidt Information Criterion, can be optimized via gradient adaptations on the Stiefel manifold, leading to significant acceleration over spectral methods relying on eigendecompositions. Finally, our trained embedding can be directly applied to out-of-sample data. We show experimentally that our approach outperforms several state-of-the-art deep clustering methods, as well as traditional approaches such as $k$-means and spectral clustering over a broad array of real-life and synthetic datasets.


Flood Prediction Using Machine Learning Models: Literature Review

arXiv.org Machine Learning

Floods are among the most destructive natural disasters, which are highly complex to model. The research on the advancement of flood prediction models contributed to risk reduction, policy suggestion, minimization of the loss of human life, and reduction the property damage associated with floods. To mimic the complex mathematical expressions of physical processes of floods, during the past two decades, machine learning (ML) methods contributed highly in the advancement of prediction systems providing better performance and cost-effective solutions. Due to the vast benefits and potential of ML, its popularity dramatically increased among hydrologists. Researchers through introducing novel ML methods and hybridizing of the existing ones aim at discovering more accurate and efficient prediction models. The main contribution of this paper is to demonstrate the state of the art of ML models in flood prediction and to give insight into the most suitable models. In this paper, the literature where ML models were benchmarked through a qualitative analysis of robustness, accuracy, effectiveness, and speed are particularly investigated to provide an extensive overview on the various ML algorithms used in the field. The performance comparison of ML models presents an in-depth understanding of the different techniques within the framework of a comprehensive evaluation and discussion. As a result, this paper introduces the most promising prediction methods for both long-term and short-term floods. Furthermore, the major trends in improving the quality of the flood prediction models are investigated. Among them, hybridization, data decomposition, algorithm ensemble, and model optimization are reported as the most effective strategies for the improvement of ML methods.


Agglomerative Fast Super-Paramagnetic Clustering

arXiv.org Machine Learning

Concretely, that the proposed algorithm does in fact recover the correct super-paramagnetic cluster configurations that are near the entropy maxima. Previous cases studies include data clustering of stocks [15] and gene data in [4], temporal states of financial markets [8], and state-detection for adaptive machine learning in trading [5]. There is an endless variety of potential use-cases for this type of fast big-data clustering technology. Building on prior work we propose and demonstrate an alternative to fast Super-Paramagnetic Clustering (f-SPC) [15] using a modern and streamlined implementation of the "Merging Algorithm" first suggested by Gi-ada [4], one that can recover the same cluster configurations for a variety of test-cases, but with significantly reduced compute times. We again use the Noh Ansatz [11] and the Maximum Likelihood Estimation approach introduced by Giada and Marsili [4]. We call the new algorithm Agglomerative Super-Paramagnetic Clustering (ASPC) and it has the benefit of being less computationally expensive than the PGAs implemented in [5, 6, 15].


Transferring knowledge from monitored to unmonitored areas for forecasting parking spaces

arXiv.org Artificial Intelligence

Smart cities around the world have begun monitoring parking areas in order to estimate available parking spots and help drivers looking for parking. The current results are promising, indeed. However, existing approaches are limited by the high cost of sensors that need to be installed throughout the city in order to achieve an accurate estimation. This work investigates the extension of estimating parking information from areas equipped with sensors to areas where they are missing. To this end, the similarity between city neighborhoods is determined based on background data, i.e., from geographic information systems. Using the derived similarity values, we analyze the adaptation of occupancy rates from monitored- to unmonitored parking areas.