AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Modal-set estimation with an application to clustering

Jiang, Heinrich, Kpotufe, Samory

arXiv.org Machine LearningJun-13-2016

We present a first procedure that can estimate -- with statistical consistency guarantees -- any local-maxima of a density, under benign distributional conditions. The procedure estimates all such local maxima, or $\textit{modal-sets}$, of any bounded shape or dimension, including usual point-modes. In practice, modal-sets can arise as dense low-dimensional structures in noisy data, and more generally serve to better model the rich variety of locally-high-density structures in data. The procedure is then shown to be competitive on clustering applications, and moreover is quite stable to a wide range of settings of its tuning parameter.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

1606.04166

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Nonparametric Modeling of Dynamic Functional Connectivity in fMRI Data

Nielsen, Søren F. V., Madsen, Kristoffer H., Røge, Rasmus, Schmidt, Mikkel N., Mørup, Morten

arXiv.org Machine LearningJun-8-2016

Dynamic functional connectivity (FC) has in recent years become a topic of interest in the neuroimaging community. Several models and methods exist for both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG), and the results point towards the conclusion that FC exhibits dynamic changes. The existing approaches modeling dynamic connectivity have primarily been based on time-windowing the data and k-means clustering. We propose a non-parametric generative model for dynamic FC in fMRI that does not rely on specifying window lengths and number of dynamic states. Rooted in Bayesian statistical modeling we use the predictive likelihood to investigate if the model can discriminate between a motor task and rest both within and across subjects. We further investigate what drives dynamic states using the model on the entire data collated across subjects and task/rest. We find that the number of states extracted are driven by subject variability and preprocessing differences while the individual states are almost purely defined by either task or rest. This questions how we in general interpret dynamic FC and points to the need for more research on what drives dynamic FC.

connectivity, covariance matrix, ihmm-wishart, (13 more...)

arXiv.org Machine Learning

1601.00496

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation

Srivastava, Akash, Zou, James, Sutton, Charles

arXiv.org Machine LearningJun-8-2016

A good clustering can help a data analyst to explore and understand a data set, but what constitutes a good clustering may depend on domain-specific and application-specific criteria. These criteria can be difficult to formalize, even when it is easy for an analyst to know a good clustering when she sees one. We present a new approach to interactive clustering for data exploration, called \ciif, based on a particularly simple feedback mechanism, in which an analyst can choose to reject individual clusters and request new ones. The new clusters should be different from previously rejected clusters while still fitting the data well. We formalize this interaction in a novel Bayesian prior elicitation framework. In each iteration, the prior is adapted to account for all the previous feedback, and a new clustering is then produced from the posterior distribution. To achieve the computational efficiency necessary for an interactive setting, we propose an incremental optimization method over data minibatches using Lagrangian relaxation. Experiments demonstrate that \ciif can produce accurate and diverse clusterings.

artificial intelligence, feedback iteration, machine learning, (15 more...)

arXiv.org Machine Learning

1602.06886

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

Add feedback

Phase Transitions for High Dimensional Clustering and Related Problems

Jin, Jiashun, Ke, Zheng Tracy, Wang, Wanjie

arXiv.org Machine LearningJun-8-2016

Consider a two-class clustering problem where we observe $X_i = \ell_i \mu + Z_i$, $Z_i \stackrel{iid}{\sim} N(0, I_p)$, $1 \leq i \leq n$. The feature vector $\mu\in R^p$ is unknown but is presumably sparse. The class labels $\ell_i\in\{-1, 1\}$ are also unknown and the main interest is to estimate them. We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the Region of Impossibility and Region of Possibility. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam's idea on comparison of experiments. We also extend the study on statistical limits for clustering to that for signal recovery and that for hypothesis testing. We compare the statistical limits for three problems and expose some interesting insight. We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold $t > 0$, IF-PCA clusters by applying classical PCA to all columns of $X$ with an $L^2$-norm larger than $t$. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering. We find an interesting phase transition for IF-PCA. Our results require delicate analysis, especially on post-selection Random Matrix Theory and on lower bound arguments.

artificial intelligence, machine learning, probability, (17 more...)

arXiv.org Machine Learning

1502.06952

Country: North America > United States > Pennsylvania (0.46)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology > Leukemia (0.46)
Health & Medicine > Therapeutic Area > Hematology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)

Add feedback

Understanding data mining clustering methods

#artificialintelligenceJun-7-2016, 13:17:04 GMT

When you go to the grocery store, you see that items of a similar nature are displayed nearby to each other. When you organize the clothes in your closet, you put similar items together (e.g. Every personal organizing tip on the web to save you from your clutter suggests some sort of grouping of similar items together. Even we don't notice it, we are involved in grouping similar objects together in every aspect of our life. This is called clustering in machine learning, so in this post I will provide an overview of data mining clustering methods. In machine learning or data mining, clustering assigns similar objects together in order to discover structures in data that doesn't have any labels.

artificial intelligence, data mining, machine learning, (15 more...)

#artificialintelligence

Genre: Overview (0.97)

Industry:

Media > Music (0.74)
Leisure & Entertainment (0.74)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Python: K Means Clustering Part 2

#artificialintelligenceJun-7-2016, 09:47:19 GMT

In part 2 we are going focus on checking our assumptions. So far we have learned how to perform a K Means Cluster. When running a K Means Cluster, you first have to choose how many clusters you want. But what is the optimal number of clusters? This is the "art" part of an algorithm like this.

artificial intelligence, machine learning, means clustering part 2, (2 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.73)

Add feedback

Robust Ensemble Clustering Using Probability Trajectories

Huang, Dong, Lai, Jian-Huang, Wang, Chang-Dong

arXiv.org Machine LearningJun-3-2016

Note that V Y L Link set of G w ij W eight between two nodes in G G K -elite neighbor graph (K -ENG) V Node set of G . Note that V Y L Link set of G w ij W eight between two nodes in G p ij (1-step) transition probability fromy i to y j P (1-step) transition probability matrix,P { p ij } N N p T ij T -step transition probability fromy i to y j P T T -step transition probability matrix,P T { p T ij } N N p T i: The i -th row ofP T, p T i: { p T i 1,···,p T i N} PT T i Probability trajectory of a random walker starting fromnode y i with lengthT PTS ij Probability trajectory based similarity betweeny i and y j R (0) Set of the initial regions for PTA,R (0) { R (0) 1,···,R (0) R (0) } S (0) Initial similarity matrix for PTA,S (0) { s (0) ij } R (0) R (0) R ( t) Set of thet -step regions for PTA, R ( t) { R ( t) 1,···,R ( t) R ( t) } S ( t) The t -step similarity matrix for PTA,S ( t) { s ( t) ij } R ( t) R ( t) G Microcluster-cluster bipartite graph (MCBG) N Number of nodes in G V Node set of G L Link set of G w ij W eight between two nodes in G A sparse graph termedK -elite neighbor graph (K -ENG) is then constructed with only a small number of probably reliable links. The ENS strategy is a crucial step in our approach. W e argue that using a small number of probably reliable links may lead to significantly better consensus results than using all graph links regardless of their reliability . The random walk process driven by a new transition probability matrix is performed on theK -ENG to explore the global structure information.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1109/TKDE.2015.2503753

1606.0116

Country: Asia > China > Guangdong Province (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis

Huang, Dong, Lai, Jian-Huang, Wang, Chang-Dong

arXiv.org Machine LearningJun-3-2016

The clustering ensemble technique aims to combine multiple clusterings into a probably better and more robust clustering and has been receiving an increasing attention in recent years. There are mainly two aspects of limitations in the existing clustering ensemble approaches. Firstly, many approaches lack the ability to weight the base clusterings without access to the original data and can be affected significantly by the low-quality, or even ill clusterings. Secondly, they generally focus on the instance level or cluster level in the ensemble system and fail to integrate multi-granularity cues into a unified model. To address these two limitations, this paper proposes to solve the clustering ensemble problem via crowd agreement estimation and multigranularity link analysis. We present the normalized crowd agreement index (NCAI) to evaluate the quality of base clusterings in an unsupervised manner and thus weight the base clusterings in accordance with their clustering validity. To explore the relationship between clusters, the source aware connected triple (SACT) similarity is introduced with regard to their common neighbors and the source reliability. Present address: School of Information Science and Technology, Sun Yat-sen University, Guangzhou Higher Education Mega Center, Panyu District, Guangzhou, Guangdong, 510006, P. R. China. The experiments are conducted on eight real-world datasets. The experimental results demonstrate the effectiveness and robustness of the proposed methods. Keywords: Clustering ensemble, Clustering aggregation, Weighted evidence accumulation clustering, Graph partitioning with multi-granularity link analysis 1. Introduction Data clustering is a fundamental and very challenging problem in data mining and machine learning.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1016/j.neucom.2014.05.094

1405.1297

Country: Asia > China > Guangdong Province > Guangzhou (0.44)

Genre: Research Report > New Finding (0.34)

Industry: Education (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Automatic Identification of Replicated Criminal Websites Using Combined Clustering Methods

@machinelearnbotJun-1-2016, 17:25:43 GMT

The following publication was presented at the 2014 IEEE International Workshop on Cyber Crime and received the Best Paper Award on 5/18/2014. The original IEEE LaTeX formatted PDF publication can also be downloaded from here: IWCC Combined Clustering. To be successful, cybercriminals must figure out how to scale their scams. They duplicate content on new websites, often staying one step ahead of defenders that shut down past schemes. For some scams, such as phishing and counterfeitgoods shops, the duplicated content remains nearly identical. In others, such as advanced-fee fraud and online Ponzi schemes, the criminal must alter content so that it appears different in order to evade detection by victims and law enforcement. Nevertheless, similarities often remain, in terms of the website structure or content, since making truly unique copies does not scale well. In this paper, we present a novel combined clustering method that links together replicated scam websites, even when the criminal has taken steps to hide connections. We evaluate its performance against two collected datasets of scam websites: fake-escrow services and high-yield investment programs (HYIPs). We find that our method more accurately groups similar websites together than does existing general-purpose consensus clustering methods.

artificial intelligence, machine learning, website, (15 more...)

@machinelearnbot

Country:

North America > United States (0.46)
Oceania > Australia (0.14)

Genre: Personal > Honors (0.54)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Short Communication on QUIST: A Quick Clustering Algorithm

Baddar, Sherenaz W. Al-Haj

arXiv.org Machine LearningJun-1-2016

Then, it starts splitting C into smaller sub-clusters, until either one of the following conditions holds, whichever happens first: 1. The instances within each sub-cluster, c, are too similar to be divided any further, or 2. The number of clusters hits an optional upper bound provided by the user. If such bound is not provided by the user, it is assume to be equal to the number of input instances, C Additionally, splitting a given cluster stops when it reaches a minimum cluster size per the user's choice. To decide whether or not instances within a cluster, c, are similar enough to stop splitting it, QUIST calculates the "spreadness" metric denoted by Ψ, such that the spreadness of c, denoted by Ψ

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1606.00398

Country: North America > United States (0.15)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback