AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Forming IDEAS Interactive Data Exploration & Analysis System

Bridges, Robert A., Vincent, Maria A., Huffer, Kelly M. T., Goodall, John R., Jamieson, Jessie D., Burch, Zachary

arXiv.org Artificial IntelligenceMay-24-2018

Modern cyber security operations collect an enormous amount of logging and alerting data. While analysts have the ability to query and compute simple statistics and plots from their data, current analytical tools are too simple to admit deep understanding. To detect advanced and novel attacks, analysts turn to manual investigations. While commonplace, current investigations are time-consuming, intuition-based, and proving insufficient. Our hypothesis is that arming the analyst with easy-to-use data science tools will increase their work efficiency, provide them with the ability to resolve hypotheses with scientific inquiry of their data, and support their decisions with evidence over intuition. To this end, we present our work to build IDEAS (Interactive Data Exploration and Analysis System). We present three real-world use-cases that drive the system design from the algorithmic capabilities to the user interface. Finally, a modular and scalable software architecture is discussed along with plans for our pilot deployment with a security operation command.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1805.09676

Country:

North America > United States > Nebraska > Lancaster County > Lincoln (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
(3 more...)

Genre: Research Report (0.83)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.97)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Searching for a Single Community in a Graph

Ray, Avik, Sanghavi, Sujay, Shakkottai, Sanjay

arXiv.org Machine LearningMay-24-2018

In standard graph clustering/community detection, one is interested in partitioning the graph into more densely connected subsets of nodes. In contrast, the "search" problem of this paper aims to only find the nodes in a "single" such community, the target, out of the many communities that may exist. To do so , we are given suitable side information about the target; for example, a very small number of nodes from the target are labeled as such. We consider a general yet simple notion of side information: all nodes are assumed to have random weights, with nodes in the target having higher weights on average. Given these weights and the graph, we develop a variant of the method of moments that identifies nodes in the target more reliably, and with lower computation, than generic community detection methods that do not use side information and partition the entire graph. Our empirical results show significant gains in runtime, and also gains in accuracy over other graph clustering algorithms.

algorithm, node, side information, (9 more...)

arXiv.org Machine Learning

1806.07944

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hierarchical Clustering with Structural Constraints

Chatziafratis, Vaggos, Niazadeh, Rad, Charikar, Moses

arXiv.org Artificial IntelligenceMay-23-2018

Hierarchical clustering (HC) is a widely used data analysis tool, ubiquitous in information retrieval, data mining, and machine learning (see a survey by Berkhin [2006]). This clustering technique represents a given dataset as a binary tree; each leaf represents an individual data point and each internal node represents a cluster on the leaves of its descendants. HC has become the most popular method for gene expression data analysis Eisen et al. [1998], and also has been used in the analysis of social networks Leskovec et al. [2014], Mann et al. [2008], bioinformatics Diez et al. [2015], image and text classification Steinbach et al. [2000], and even in analysis of financial markets Tumminello et al. [2010]. It is attractive because it provides richer information at all levels of granularity simultaneously, compared to more traditional flat clustering approaches like k-means or k-median. Recently, Dasgupta [2016] formulated HC as a combinatorial optimization problem, giving a principled way to compare the performance of different HC algorithms. This optimization viewpoint has since received a lot of attention Roy and Pokutta [2016], Charikar and Chatziafratis [2017], Cohen-Addad et al. [2017], Moseley and Wang [2017], Cohen-Addad et al. [2018] that has led not only to the development of new algorithms but also to theoretical justifications for the observed success of popular HC algorithms (e.g.

artificial intelligence, constraint, machine learning, (18 more...)

arXiv.org Artificial Intelligence

1805.09476

Country: Asia > Afghanistan > Parwan Province > Charikar (0.25)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hypergraph Spectral Clustering in the Weighted Stochastic Block Model

Ahn, Kwangjun, Lee, Kangwook, Suh, Changho

arXiv.org Machine LearningMay-23-2018

Spectral clustering is a celebrated algorithm that partitions objects based on pairwise similarity information. While this approach has been successfully applied to a variety of domains, it comes with limitations. The reason is that there are many other applications in which only \emph{multi}-way similarity measures are available. This motivates us to explore the multi-way measurement setting. In this work, we develop two algorithms intended for such setting: Hypergraph Spectral Clustering (HSC) and Hypergraph Spectral Clustering with Local Refinement (HSCLR). Our main contribution lies in performance analysis of the poly-time algorithms under a random hypergraph model, which we name the weighted stochastic block model, in which objects and multi-way measures are modeled as nodes and weights of hyperedges, respectively. Denoting by $n$ the number of nodes, our analysis reveals the following: (1) HSC outputs a partition which is better than a random guess if the sum of edge weights (to be explained later) is $\Omega(n)$; (2) HSC outputs a partition which coincides with the hidden partition except for a vanishing fraction of nodes if the sum of edge weights is $\omega(n)$; and (3) HSCLR exactly recovers the hidden partition if the sum of edge weights is on the order of $n \log n$. Our results improve upon the state of the arts recently established under the model and they firstly settle the order-wise optimal results for the binary edge weight case. Moreover, we show that our results lead to efficient sketching algorithms for subspace clustering, a computer vision application. Lastly, we show that HSCLR achieves the information-theoretic limits for a special yet practically relevant model, thereby showing no computational barrier for the case.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1109/JSTSP.2018.2837638

1805.08956

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Vision (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Clustering - What Both Theoreticians and Practitioners are Doing Wrong

Ben-David, Shai

arXiv.org Machine LearningMay-22-2018

Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowa- days. However, in spite of hundreds of papers on the topic being published every year, current theoretical understanding and practical implementations of such tasks, in particular of clustering, is very rudimentary. This note focuses on clustering. I claim that the most signif- icant challenge for clustering is model selection. In contrast with other common computational tasks, for clustering, dif- ferent algorithms often yield drastically different outcomes. Therefore, the choice of a clustering algorithm, and their pa- rameters (like the number of clusters) may play a crucial role in the usefulness of an output clustering solution. However, currently there exists no methodical guidance for clustering tool-selection for a given clustering task. Practitioners pick the algorithms they use without awareness to the implications of their choices and the vast majority of theory of clustering papers focus on providing savings to the resources needed to solve optimization problems that arise from picking some concrete clustering objective. Saving that pale in com- parison to the costs of mismatch between those objectives and the intended use of clustering results. I argue the severity of this problem and describe some recent proposals aiming to address this crucial lacuna.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1805.08838

Country:

Europe (0.47)
North America > United States > Texas (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Add feedback

Quickshift++: Provably Good Initializations for Sample-Based Mean Shift

Jiang, Heinrich, Jang, Jennifer, Kpotufe, Samory

arXiv.org Machine LearningMay-21-2018

We provide initial seedings to the Quick Shift clustering algorithm, which approximate the locally high-density regions of the data. Such seedings act as more stable and expressive cluster-cores than the singleton modes found by Quick Shift. We establish statistical consistency guarantees for this modification. We then show strong clustering performance on real datasets as well as promising applications to image segmentation.

artificial intelligence, machine learning, quickshift, (17 more...)

arXiv.org Machine Learning

1805.07909

Country: North America > United States > California (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Piecewise Flat Embedding for Image Segmentation

Fang, Chaowei, Liao, Zicheng, Yu, Yizhou

arXiv.org Machine LearningMay-20-2018

We introduce a new multi-dimensional nonlinear embedding -- Piecewise Flat Embedding (PFE) -- for image segmentation. Based on the theory of sparse signal recovery, piecewise flat embedding with diverse channels attempts to recover a piecewise constant image representation with sparse region boundaries and sparse cluster value scattering. The resultant piecewise flat embedding exhibits interesting properties such as suppressing slowly varying signals, and offers an image representation with higher region identifiability which is desirable for image segmentation or high-level semantic analysis tasks. We formulate our embedding as a variant of the Laplacian Eigenmap embedding with an $L_{1,p} (0

artificial intelligence, machine learning, segmentation, (16 more...)

arXiv.org Machine Learning

1802.03248

Country: Asia > China (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Unsupervised Learning with Python – Towards Data Science

#artificialintelligenceMay-19-2018, 15:37:49 GMT

Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data. The data given to unsupervised algorithm are not labelled, which means only the input variables(X) are given with no corresponding output variables. In unsupervised learning, the algorithms are left to themselves to discover interesting structures in the data. In supervised learning, the system tries to learn from the previous examples that are given. So if the dataset is labelled it comes under a supervised problem, it the dataset is unlabelled then it is an unsupervised problem. The image to the left is an example of supervised learning; we use regression techniques to find the best fit line between the features.

artificial intelligence, dataset, machine learning, (14 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.37)

Add feedback

Estimation of Non-Normalized Mixture Models and Clustering Using Deep Representation

Matsuda, Takeru, Hyvarinen, Aapo

arXiv.org Machine LearningMay-19-2018

We develop a general method for estimating a finite mixture of non-normalized models. Here, a non-normalized model is defined to be a parametric distribution with an intractable normalization constant. Existing methods for estimating non-normalized models without computing the normalization constant are not applicable to mixture models because they contain more than one intractable normalization constant. The proposed method is derived by extending noise contrastive estimation (NCE), which estimates non-normalized models by discriminating between the observed data and some artificially generated noise. We also propose an extension of NCE with multiple noise distributions. Then, based on the observation that conventional classification learning with neural networks is implicitly assuming an exponential family as a generative model, we introduce a method for clustering unlabeled data by estimating a finite mixture of distributions in an exponential family. Estimation of this mixture model is attained by the proposed extensions of NCE where the training data of neural networks are used as noise. Thus, the proposed method provides a probabilistically principled clustering method that is able to utilize a deep representation. Application to image clustering using a deep neural network gives promising results.

mixture model, nce, noise distribution, (17 more...)

arXiv.org Machine Learning

1805.07516

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Spectral feature scaling method for supervised dimensionality reduction

Matsuda, Momo, Morikuni, Keiichi, Sakurai, Tetsuya

arXiv.org Machine LearningMay-17-2018

Spectral dimensionality reduction methods enable linear separations of complex data with high-dimensional features in a reduced space. However, these methods do not always give the desired results due to irregularities or uncertainties of the data. Thus, we consider aggressively modifying the scales of the features to obtain the desired classification. Using prior knowledge on the labels of partial samples to specify the Fiedler vector, we formulate an eigenvalue problem of a linear matrix pencil whose eigenvector has the feature scaling factors. The resulting factors can modify the features of entire samples to form clusters in the reduced space, according to the known labels. In this study, we propose new dimensionality reduction methods supervised using the feature scaling associated with the spectral clustering. Numerical experiments show that the proposed methods outperform well-established supervised methods for toy problems with more samples than features, and are more robust regarding clustering than existing methods. Also, the proposed methods outperform existing methods regarding classification for real-world problems with more features than samples of gene expression profiles of cancer diseases. Furthermore, the feature scaling tends to improve the clustering and classification accuracies of existing unsupervised methods, as the proportion of training data increases.

accuracy, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1805.07006

Country:

North America > Canada (0.04)
Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)

Add feedback