AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Detection of Thin Boundaries between Different Types of Anomalies in Outlier Detection using Enhanced Neural Networks

Kiani, Rasoul, Keshavarzi, Amin, Bohlouli, Mahdi

arXiv.org Machine LearningJan-24-2020

Outlier detection has received special attention in various fields, mainly for those dealing with machine learning and artificial intelligence. As strong outliers, anomalies are divided into the point, contextual and collective outliers. The most important challenges in outlier detection include the thin boundary between the remote points and natural area, the tendency of new data and noise to mimic the real data, unlabelled datasets and different definitions for outliers in different applications. Considering the stated challenges, we defined new types of anomalies called Collective Normal Anomaly and Collective Point Anomaly in order to improve a much better detection of the thin boundary between different types of anomalies. Basic domain-independent methods are introduced to detect these defined anomalies in both unsupervised and supervised datasets. The Multi-Layer Perceptron Neural Network is enhanced using the Genetic Algorithm to detect newly defined anomalies with higher precision so as to ensure a test error less than that calculated for the conventional Multi-Layer Perceptron Neural Network. Experimental results on benchmark datasets indicated reduced error of anomaly detection process in comparison to baselines.

anomaly, dataset, detection, (15 more...)

arXiv.org Machine Learning

2001.09209

Country:

North America > United States > Hawaii (0.04)
North America > Canada (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
Asia > Middle East > Iran > Zanjan Province > Zanjan (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)
Law Enforcement & Public Safety (0.69)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Towards Automatic Clustering Analysis using Traces of Information Gain: The InfoGuide Method

Rocha, Paulo, Pinheiro, Diego, Cadeiras, Martin, Bastos-Filho, Carmelo

arXiv.org Machine LearningJan-23-2020

Clustering analysis has become a ubiquitous information retrieval tool in a wide range of domains, but a more automatic framework is still lacking. Though internal metrics are the key players towards a successful retrieval of clusters, their effectiveness on real-world datasets remains not fully understood, mainly because of their unrealistic assumptions underlying datasets. We hypothesized that capturing {\it traces of information gain} between increasingly complex clustering retrievals---{\it InfoGuide}---enables an automatic clustering analysis with improved clustering retrievals. We validated the {\it InfoGuide} hypothesis by capturing the traces of information gain using the Kolmogorov-Smirnov statistic and comparing the clusters retrieved by {\it InfoGuide} against those retrieved by other commonly used internal metrics in artificially-generated, benchmarks, and real-world datasets. Our results suggested that {\it InfoGuide} can enable a more automatic clustering analysis and may be more suitable for retrieving clusters in real-world datasets displaying nontrivial statistical properties.

algorithm, dataset, retrieval, (15 more...)

arXiv.org Machine Learning

2001.08677

Country:

South America > Brazil > Pernambuco (0.04)
North America > United States > New Jersey > Camden County > Jackson (0.04)
North America > United States > California > Yolo County > Davis (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.48)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Coarse-Grain Cluster Analysis of Tensors With Application to Climate Biome Identification

DeSantis, Derek, Wolfram, Phillip J., Bennett, Katrina, Alexandrov, Boian

arXiv.org Machine LearningJan-21-2020

A tensor provides a concise way to codify the interdependence of complex data. Treating a tensor as a d-way array, each entry records the interaction between the different indices. Clustering provides a way to parse the complexity of the data into more readily understandable information. Clustering methods are heavily dependent on the algorithm of choice, as well as the chosen hyperparameters of the algorithm. However, their sensitivity to data scales is largely unknown. In this work, we apply the discrete wavelet transform to analyze the effects of coarse-graining on clustering tensor data. We are particularly interested in understanding how scale effects clustering of the Earth's climate system. The discrete wavelet transform allows classification of the Earth's climate across a multitude of spatial-temporal scales. The discrete wavelet transform is used to produce an ensemble of classification estimates, as opposed to a single classification. Using information theory, we discover a sub-collection of the ensemble that span the majority of the variance observed, allowing for efficient consensus clustering techniques that can be used to identify climate biomes.

algorithm, ensemble, information, (16 more...)

arXiv.org Machine Learning

2001.07827

Country:

North America > United States > New Mexico > Los Alamos County > Los Alamos (0.05)
North America > United States > Rocky Mountains (0.04)
North America > Canada > Rocky Mountains (0.04)
(2 more...)

Genre: Workflow (0.68)

Technology:

Information Technology > Data Science > Data Quality > Data Transformation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Heterogeneous Transfer Learning in Ensemble Clustering

Berikov, Vladimir

arXiv.org Machine LearningJan-20-2020

This work proposes an ensemble clustering method using transfer learning approach. We consider a clustering problem, in which in addition to data under consideration, "similar" labeled data are available. The datasets can be described with different features. The method is based on constructing meta-features which describe structural characteristics of data, and their transfer from source to target domain. An experimental study of the method using Monte Carlo modeling has confirmed its efficiency. In comparison with other similar methods, the proposed one is able to work under arbitrary feature descriptions of source and target domains; it has smaller complexity.

algorithm, ensemble, partition, (14 more...)

arXiv.org Machine Learning

2001.07155

Country:

Europe > Russia (0.04)
Asia > Russia > Siberian Federal District > Novosibirsk Oblast > Novosibirsk (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Randomized Spectral Clustering in Large-Scale Stochastic Block Models

Zhang, Hai, Guo, Xiao, Chang, Xiangyu

arXiv.org Machine LearningJan-19-2020

Spectral clustering has been one of the widely used methods for community detection in networks. However, large-scale networks bring computational challenge to it. In this paper, we study spectral clustering using randomized sketching algorithms from a statistical perspective, where we typically assume the network data are generated from a stochastic block model. To do this, we first use the recent developed sketching algorithms to derive two randomized spectral clustering algorithms, namely, the random projection-based and the random sampling-based spectral clustering. Then we study the theoretical bounds of the resulting algorithms in terms of the approximation error for the population adjacency matrix, the misclustering error, and the estimation error for the link probability matrix. It turns out that, under mild conditions, the randomized spectral clustering algorithms perform similarly to the original one. We also conduct numerical experiments to support the theoretical findings.

adjacency matrix, matrix, spectral, (14 more...)

arXiv.org Machine Learning

2002.00839

Country:

North America > United States > Texas > Harris County > Houston (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.64)

Industry:

Government (0.46)
Information Technology (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Multi-Level Representation Learning for Deep Subspace Clustering

Kheirandishfard, Mohsen, Zohrizadeh, Fariba, Kamangar, Farhad

arXiv.org Machine LearningJan-19-2020

This paper proposes a novel deep subspace clustering approach which uses convolutional autoencoders to transform input images into new representations lying on a union of linear subspaces. The first contribution of our work is to insert multiple fully-connected linear layers between the encoder layers and their corresponding decoder layers to promote learning more favorable representations for subspace clustering. These connection layers facilitate the feature learning procedure by combining low-level and high-level information for generating multiple sets of self-expressive and informative representations at different levels of the encoder. Moreover, we introduce a novel loss minimization problem which leverages an initial clustering of the samples to effectively fuse the multi-level representations and recover the underlying subspaces more accurately. The loss function is then minimized through an iterative scheme which alternatively updates the network parameters and produces new clusterings of the samples. Experiments on four real-world datasets demonstrate that our approach exhibits superior performance compared to the state-of-the-art methods on most of the subspace clustering problems.

dataset, representation, subspace, (16 more...)

arXiv.org Machine Learning

2001.08533

Country:

North America > United States > Texas (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Learning Options from Demonstration using Skill Segmentation

Cockcroft, Matthew, Mawjee, Shahil, James, Steven, Ranchod, Pravesh

arXiv.org Machine LearningJan-19-2020

We present a method for learning options from segmented demonstration trajectories. The trajectories are first segmented into skills using nonparametric Bayesian clustering and a reward function for each segment is then learned using inverse reinforcement learning. From this, a set of inferred trajectories for the demonstration are generated. Option initiation sets and termination conditions are learned from these trajectories using the one-class support vector machine clustering algorithm. We demonstrate our method in the four rooms domain, where an agent is able to autonomously discover usable options from human demonstration. Our results show that these inferred options can then be used to improve learning and planning.

termination condition, termination state, trajectory, (14 more...)

arXiv.org Machine Learning

2001.06793

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Africa > South Africa > Gauteng > Johannesburg (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

A Classification-Based Approach to Semi-Supervised Clustering with Pairwise Constraints

Śmieja, Marek, Struski, Łukasz, Figueiredo, Mário A. T.

arXiv.org Machine LearningJan-18-2020

A Classification-Based Approach to Semi-Supervised Clustering with Pairwise Constraints Marek Smieja a,, Łukasz Struski a, Mário A. T. Figueiredo b a Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland b Instituto de T elecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, PortugalAbstract In this paper, we introduce a neural network framework for semi-supervised clustering (SSC) with pairwise (must-link or cannot-link) constraints. In contrast to existing approaches, we decompose SSC into two simpler classification tasks/stages: the first stage uses a pair of Siamese neural networks to label the unlabeled pairs of points as must-link or cannot-link; the second stage uses the fully pairwise-labeled dataset produced by the first stage in a supervised neural-network-based clustering method. The proposed approach, S 3 C 2 (Semi-Supervised Siamese C lassifiers for C lustering), is motivated by the observation that binary classification (such as assigning pairwise relations) is usually easier than multi-class clustering with partial supervision. On the other hand, being classification-based, our method solves only well-defined classification problems, rather than less well specified clustering tasks. Extensive experiments on various datasets demonstrate the high performance of the proposed method. Keywords: semi-supervised clustering, deep learning, neural networks, pairwise constraints 1. Introduction Clustering is an important unsupervised learning tool often used to analyze the structure of complex high-dimensional data. Semi-supervised clustering (SSC) methods tackle this issue by leveraging partial prior information about class labels, with the goal of obtaining partitions that are better aligned with true classes [1, 2, 3, 4, 5, 6]. One typical way of injecting class label information into clustering is in the form of pairwise constraints (typically, must-link and cannot-link constraints), or pairwise preferences (e.g., should-link and shouldn't-link), which indicate whether a given pair of points is believed to belong to the same or different classes. Most SSC approaches rely on adapting existing unsupervised clustering methods to handle partial (namely, pairwise) information [7, 8, 4, 5, 6, 9]. This requires transferring class-label knowledge into a clustering algorithm, which is often unnatural and puts a higher weight on clustering structure than on class labels.

constraint, labnet, pairwise constraint, (14 more...)

arXiv.org Machine Learning

2001.0672

Country:

Europe > Portugal > Lisbon > Lisbon (0.54)
Europe > Poland > Lesser Poland Province > Kraków (0.24)
North America > United States > District of Columbia > Washington (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A survey on Machine Learning-based Performance Improvement of Wireless Networks: PHY, MAC and Network layer

Kulin, Merima, Kazaz, Tarik, Moerman, Ingrid, de Poorter, Eli

arXiv.org Machine LearningJan-18-2020

This paper provides a systematic and comprehensive survey that reviews the latest research efforts focused on machine learning (ML) based performance improvement of wireless networks, while considering all layers of the protocol stack (PHY, MAC and network). First, the related work and paper contributions are discussed, followed by providing the necessary background on data-driven approaches and machine learning for non-machine learning experts to understand all discussed techniques. Then, a comprehensive review is presented on works employing ML-based approaches to optimize the wireless communication parameters settings to achieve improved network quality-of-service (QoS) and quality-of-experience (QoE). We first categorize these works into: radio analysis, MAC analysis and network prediction approaches, followed by subcategories within each. Finally, open challenges and broader perspectives are discussed.

algorithm, neural network, wireless network, (15 more...)

arXiv.org Machine Learning

2001.04561

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Europe > Netherlands > South Holland > Delft (0.04)
(8 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)
Instructional Material > Course Syllabus & Notes (0.67)

Industry:

Telecommunications > Networks (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Networks (1.00)
(5 more...)

Technology:

Information Technology > Communications > Networks > Sensor Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(6 more...)

Add feedback

Neighborhood Structure Assisted Non-negative Matrix Factorization and its Application in Unsupervised Point Anomaly Detection

Ahmed, Imtiaz, Hu, Xia Ben, Acharya, Mithun P., Ding, Yu

arXiv.org Machine LearningJan-17-2020

Dimensionality reduction is considered as an important step for ensuring competitive performance in unsupervised learning such as anomaly detection. Non-negative matrix factorization (NMF) is a popular and widely used method to accomplish this goal. But NMF, together with its recent, enhanced version, like graph regularized NMF or symmetric NMF, do not have the provision to include the neighborhood structure information and, as a result, may fail to provide satisfactory performance in presence of nonlinear manifold structure. To address that shortcoming, we propose to consider and incorporate the neighborhood structural similarity information within the NMF framework by modeling the data through a minimum spanning tree. What motivates our choice is the understanding that in the presence of complicated data structure, a minimum spanning tree can approximate the intrinsic distance between two data points better than a simple Euclidean distance does, and consequently, it constitutes a more reasonable basis for differentiating anomalies from the normal class data. We label the resulting method as the neighborhood structure assisted NMF. By comparing the formulation and properties of the neighborhood structure assisted NMF with other versions of NMF including graph regularized NMF and symmetric NMF, it is apparent that the inclusion of the neighborhood structure information using minimum spanning tree makes a key difference. We further devise both offline and online algorithmic versions of the proposed method. Empirical comparisons using twenty benchmark datasets as well as an industrial dataset extracted from a hydropower plant demonstrate the superiority of the neighborhood structure assisted NMF and support our claim of merit.

anomaly, anomaly detection, matrix, (12 more...)

arXiv.org Machine Learning

2001.06541

Country:

North America > United States > Texas > Brazos County > College Station (0.14)
North America > United States > North Carolina > Wake County > Raleigh (0.04)
North America > United States > Minnesota (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.46)

Industry: Energy > Power Industry (0.66)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback