AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Unsupervised learning and data clustering for the construction of Galaxy Catalogs in the Dark Energy Survey

Khan, Asad, Huerta, E. A., Wang, Sibo, Gruendl, Robert

arXiv.org Machine LearningDec-5-2018

Large scale astronomical surveys continue to increase their depth and scale, providing new opportunities to observe large numbers of celestial objects with ever increasing precision. At the same time, the sheer scale of ongoing and future surveys pose formidable challenges to classify astronomical objects. Pioneering efforts on this front include the citizen science approach adopted by the Sloan Digital Sky Survey (SDSS). These SDSS datasets have been used recently to train neural network models to classify galaxies in the Dark Energy Survey (DES) that overlap the footprint of both surveys. While this represents a significant step to classify unlabeled images of astrophysical objects in DES, the key issue at heart still remains, i.e., the classification of unlabelled DES galaxies that have not been observed in previous surveys. To start addressing this timely and pressing matter, we demonstrate that knowledge from deep learning algorithms trained with real-object images can be transferred to classify elliptical and spiral galaxies that overlap both SDSS and DES surveys, achieving state-of-the-art accuracy 99.6%. More importantly, to initiate the characterization of unlabelled DES galaxies that have not been observed in previous surveys, we demonstrate that our neural network model can also be used for unsupervised clustering, grouping together unlabeled DES galaxies into spiral and elliptical types. We showcase the application of this novel approach by classifying over ten thousand unlabelled DES galaxies into spiral and elliptical classes. We conclude by showing that unsupervised clustering can be combined with recursive training to start creating large-scale DES galaxy catalogs in preparation for the Large Synoptic Survey Telescope era.

artificial intelligence, galaxy, machine learning, (16 more...)

arXiv.org Machine Learning

1812.02183

Country:

North America > United States > Illinois (0.30)
Europe > United Kingdom > England (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Energy (1.00)
Government > Regional Government > North America Government > United States Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multiple Manifold Clustering Using Curvature Constrained Path

Babaeian, Amir

arXiv.org Machine LearningDec-4-2018

The problem of multiple surface clustering is a challenging task, particularly when the surfaces intersect. Available methods such as Isomap fail to capture the true shape of the surface nearby the intersection and result in incorrect clustering. The Isomap algorithm uses the shortest path between points. The main draw back of the shortest path algorithm is due to the lack of curvature constrained where causes to have a path between points on different surfaces. In this paper, we tackle this problem by imposing a curvature constraint to the shortest path algorithm used in Isomap. The algorithm chooses several landmark nodes at random and then checks whether there is a curvature constrained path between each landmark node and every other node in the neighbourhood graph. We build a binary feature vector for each point where each entry represents the connectivity of that point to a particular landmark. Then the binary feature vectors could be used as an input of conventional clustering algorithm such as hierarchical clustering. We apply our method to simulated and some real datasets and show, it performs comparably to the best methods such as K-manifold and spectral multi-manifold clustering.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1812.02327

Country: North America > United States (0.94)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Node Embedding with Adaptive Similarities for Scalable Learning over Graphs

Berberidis, Dimitris, Giannakis, Georgios B.

arXiv.org Machine LearningDec-3-2018

Node embedding is the task of extracting informative and descriptive features over the nodes of a graph. The importance of node embeddings for graph analytics, as well as learning tasks such as node classification, link prediction and community detection, has led to increased interest on the problem leading to a number of recent advances. Much like PCA in the feature domain, node embedding is an inherently \emph{unsupervised} task; in lack of metadata used for validation, practical methods may require standardization and limiting the use of tunable hyperparameters. Finally, node embedding methods are faced with maintaining scalability in the face of large-scale real-world graphs of ever-increasing sizes. In the present work, we propose an adaptive node embedding framework that adjusts the embedding process to a given underlying graph, in a fully unsupervised manner. To achieve this, we adopt the notion of a tunable node similarity matrix that assigns weights on paths of different length. The design of the multilength similarities ensures that the resulting embeddings also inherit interpretable spectral properties. The proposed model is carefully studied, interpreted, and numerically evaluated using stochastic block models. Moreover, an algorithmic scheme is proposed for training the model parameters effieciently and in an unsupervised manner. We perform extensive node classification, link prediction, and clustering experiments on many real world graphs from various domains, and compare with state-of-the-art scalable and unsupervised node embedding alternatives. The proposed method enjoys superior performance in many cases, while also yielding interpretable information on the underlying structure of the graph.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

1811.10797

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(17 more...)

Genre:

Research Report (1.00)
Personal (0.67)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

From the User to the Medium: Neural Profiling Across Web Communities

Akbari, Mohammad, Relia, Kunal, Elghafari, Anas, Chunara, Rumi

arXiv.org Artificial IntelligenceDec-3-2018

Online communities provide a unique way for individuals to access information from those in similar circumstances, which can be critical for health conditions that require daily and personalized management. As these groups and topics often arise organically, identifying the types of topics discussed is necessary to understand their needs. As well, these communities and people in them can be quite diverse, and existing community detection methods have not been extended towards evaluating these heterogeneities. This has been limited as community detection methodologies have not focused on community detection based on semantic relations between textual features of the user-generated content. Thus here we develop an approach, NeuroCom, that optimally finds dense groups of users as communities in a latent space inferred by neural representation of published contents of users. By embedding of words and messages, we show that NeuroCom demonstrates improved clustering and identifies more nuanced discussion topics in contrast to other common unsupervised learning approaches.

data mining, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

1812.00912

Genre: Research Report (0.66)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

Prototype-based Neural Network Layers: Incorporating Vector Quantization

Saralajew, Sascha, Holdijk, Lars, Rees, Maike, Villmann, Thomas

arXiv.org Artificial IntelligenceDec-3-2018

Neural networks currently dominate the machine learning community and they do so for good reasons. Their accuracy on complex tasks such as image classification is unrivaled at the moment and with recent improvements they are reasonably easy to train. Nevertheless, neural networks are lacking robustness and interpretability. Prototype-based vector quantization methods on the other hand are known for being robust and interpretable. For this reason, we propose techniques and strategies to merge both approaches. This contribution will particularly highlight the similarities between them and outline how to construct a prototype-based classification layer for multilayer networks. Additionally, we provide an alternative, prototype-based, approach to the classical convolution operation. Numerical results are not part of this report, instead the focus lays on establishing a strong theoretical framework. By publishing our framework and the respective theoretical considerations and justifications before finalizing our numerical experiments we hope to jump-start the incorporation of prototype-based learning in neural networks and vice versa.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1812.01214

Country:

North America > United States (1.00)
Europe (1.00)

Genre:

Overview (0.93)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.46)

Add feedback

Why Use K-Means for Time Series Data? (Part Two) - DZone Big Data

#artificialintelligenceDec-2-2018, 06:05:49 GMT

In "Why Use K-Means for Time Series Data? (Part One)," I give an overview of how to use different statistical functions and K-Means Clustering for anomaly detection for time series data. I recommend checking that out if you're unfamiliar with either. I am borrowing the code and dataset for this portion from Amid Fish's tutorial. Please take a look at it, it's pretty awesome. In this example, I will show you how you can detect anomalies in EKG data via contextual anomaly detection with K-Means Clustering.

artificial intelligence, data mining, machine learning, (15 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.57)

Add feedback

Interpretable Clustering via Optimal Trees

Bertsimas, Dimitris, Orfanoudaki, Agni, Wiberg, Holly

arXiv.org Machine LearningDec-2-2018

State-of-the-art clustering algorithms use heuristics to partition the feature space and provide little insight into the rationale for cluster membership, limiting their interpretability. In healthcare applications, the latter poses a barrier to the adoption of these methods since medical researchers are required to provide detailed explanations of their decisions in order to gain patient trust and limit liability. We present a new unsupervised learning algorithm that leverages Mixed Integer Optimization techniques to generate interpretable tree-based clustering models. Utilizing the flexible framework of Optimal Trees [1], our method approximates the globally optimal solution leading to high quality partitions of the feature space. Our algorithm, can incorporate various internal validation metrics, naturally determines the optimal number of clusters, and is able to account for mixed numeric and categorical data. It achieves comparable or superior performance on both synthetic and real world datasets when compared to K-Means while offering significantly higher interpretability.

artificial intelligence, machine learning, separation, (14 more...)

arXiv.org Machine Learning

1812.00539

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.15)

Genre: Research Report (0.83)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

AnyThreat: An Opportunistic Knowledge Discovery Approach to Insider Threat Detection

Haidar, Diana, Gaber, Mohamed Medhat, Kovalchuk, Yevgeniya

arXiv.org Machine LearningDec-1-2018

Insider threat detection is getting an increased concern from academia, industry, and governments due to the growing number of malicious insider incidents. The existing approaches proposed for detecting insider threats still have a common shortcoming, which is the high number of false alarms (false positives). The challenge in these approaches is that it is essential to detect all anomalous behaviours which belong to a particular threat. To address this shortcoming, we propose an opportunistic knowledge discovery system, namely AnyThreat, with the aim to detect any anomalous behaviour in all malicious insider threats. We design the AnyThreat system with four components. (1) A feature engineering component, which constructs community data sets from the activity logs of a group of users having the same role. (2) An oversampling component, where we propose a novel oversampling technique named Artificial Minority Oversampling and Trapper REmoval (AMOTRE). AMOTRE first removes the minority (anomalous) instances that have a high resemblance with normal (majority) instances to reduce the number of false alarms, then it synthetically oversamples the minority class by shielding the border of the majority class. (3) A class decomposition component, which is introduced to cluster the instances of the majority class into subclasses to weaken the effect of the majority class without information loss. (4) A classification component, which applies a classification method on the subclasses to achieve a better separation between the majority class(es) and the minority class(es). AnyThreat is evaluated on synthetic data sets generated by Carnegie Mellon University. It detects approximately 87.5% of malicious insider threats, and achieves the minimum of false positives=3.36%.

artificial intelligence, insider threat, machine learning, (15 more...)

arXiv.org Machine Learning

1812.00257

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)

Add feedback

Anomaly Detection for Network Connection Logs

Mehta, Swapneel, Kothuri, Prasanth, Garcia, Daniel Lanza

arXiv.org Machine LearningNov-30-2018

We leverage a streaming architecture based on ELK, Spark and Hadoop in order to collect, store, and analyse database connection logs in near real-time. The proposed system investigates outliers using unsupervised learning; widely adopted clustering and classification algorithms for log data, highlighting the subtle variances in each model by visualisation of outliers. Arriving at a novel solution to evaluate untagged, unfiltered connection logs, we propose an approach that can be extrapolated to a generalised system of analysing connection logs across a large infrastructure comprising thousands of individual nodes and generating hundreds of lines in logs per second. I. INTRODUCTION Anomaly detection has provided a classic problem statement across multifarious use-cases ranging from scientific observations to financial transactions. We define an anomaly as a single observation or a set thereof, that fails to conform to a group of properties exhibited by larger collections of such observations.

data mining, detection, machine learning, (15 more...)

arXiv.org Machine Learning

1812.01941

Country:

North America > United States (0.47)
Europe (0.29)

Genre: Research Report (0.70)

Industry: Information Technology > Security & Privacy (0.71)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Duan, Tiehang, Lou, Qi, Srihari, Sargur N., Xie, Xiaohui

arXiv.org Machine LearningNov-29-2018

Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: 1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); 2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

1811.125

Country:

North America > United States > New York (0.29)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.83)

Add feedback