AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Randomized Clustered Nystrom for Large-Scale Kernel Machines

Pourkamali-Anaraki, Farhad, Becker, Stephen

arXiv.org Machine LearningDec-19-2016

The Nystrom method has been popular for generating the low-rank approximation of kernel matrices that arise in many machine learning problems. The approximation quality of the Nystrom method depends crucially on the number of selected landmark points and the selection procedure. In this paper, we present a novel algorithm to compute the optimal Nystrom low-approximation when the number of landmark points exceed the target rank. Moreover, we introduce a randomized algorithm for generating landmark points that is scalable to large-scale data sets. The proposed method performs K-means clustering on low-dimensional random projections of a data set and, thus, leads to significant savings for high-dimensional data sets. Our theoretical results characterize the tradeoffs between the accuracy and efficiency of our proposed method. Extensive experiments demonstrate the competitive performance as well as the efficiency of our proposed method.

approximation, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1612.0647

Country: North America > United States > Colorado (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Hierarchical Partitioning of the Output Space in Multi-label Data

Papanikolaou, Yannis, Katakis, Ioannis, Tsoumakas, Grigorios

arXiv.org Machine LearningDec-19-2016

Hierarchy Of Multi-label classifiers (HOMER) is a multi-label learning algorithm that breaks the initial learning task to several, easier sub-tasks by first constructing a hierarchy of labels from a given label set and secondly employing a given base multi-label classifier (MLC) to the resulting sub-problems. The primary goal is to effectively address class imbalance and scalability issues that often arise in real-world multi-label classification problems. In this work, we present the general setup for a HOMER model and a simple extension of the algorithm that is suited for MLCs that output rankings. Furthermore, we provide a detailed analysis of the properties of the algorithm, both from an aspect of effectiveness and computational complexity. A secondary contribution involves the presentation of a balanced variant of the k means algorithm, which serves in the first step of the label hierarchy construction. We conduct extensive experiments on six real-world datasets, studying empirically HOMER's parameters and providing examples of instantiations of the algorithm with different clustering approaches and MLCs, The empirical results demonstrate a significant improvement over the given base MLC.

artificial intelligence, inductive learning, machine learning, (18 more...)

arXiv.org Machine Learning

1612.06083

Country:

North America > United States (0.28)
Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

Fuzzy Longest Common Subsequence Matching With FCM Using R

Ozkan, Ibrahim, Turksen, I. Burhan

arXiv.org Artificial IntelligenceDec-19-2016

Capturing the interdependencies between real valued time series can be achieved by finding common similar patterns. The abstraction of time series makes the process of finding similarities closer to the way as humans do. Therefore, the abstraction by means of a symbolic levels and finding the common patterns attracts researchers. One particular algorithm, Longest Common Subsequence, has been used successfully as a simila rity measure between two sequences including real valued time series. In this paper, we propose Fuzzy Longest Common Subsequence matching for time series.

artificial intelligence, machine learning, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

1508.03671

Country: North America > Canada > Ontario (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.31)

Add feedback

109 Commonly Asked Data Science Interview Questions

#artificialintelligenceDec-18-2016, 14:35:33 GMT

What is the Central Limit Theorem and why is it important? How many sampling methods do you know? What is the difference between Type I vs Type II error? What do the terms P-value, coefficient, R-Squared value mean? What is the significance of each of these components? What are the assumptions required for linear regression? There are four major assumptions: 1. There is a linear relationship between the variables, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable. What is an example of a dataset with a non-Gaussian distribution?

artificial intelligence, data mining, machine learning, (17 more...)

#artificialintelligence

Country: North America > United States (0.04)

Genre: Personal > Interview (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.33)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.30)

Add feedback

Comparison Between Global Vs Local Normalization of Tweets, and Various Distances

@machinelearnbotDec-18-2016, 02:05:04 GMT

From the text mining literature, it appears that practitioners tend to utilize Cosine Distance to compare 2 documents. They have used it with great success. From our previous blog, we also used Cosine Distance and we also found it extremely good and helping us, and our clustering method, get an insight in the UK Exit Referendum. In here, we decided to change our initial conditions and see if we get different outcomes,i.e. We decided to try 4 others distances: Jaccard, Matching, Rogers Tanimoto and Euclidean.

artificial intelligence, machine learning, normalization, (15 more...)

@machinelearnbot

Country: Europe > United Kingdom (0.38)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.38)

Add feedback

Data Driven Resource Allocation for Distributed Learning

Dick, Travis, Li, Mu, Pillutla, Venkata Krishna, White, Colin, Balcan, Maria Florina, Smola, Alex

arXiv.org Machine LearningDec-15-2016

In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Machine Learning

1512.04848

Country: North America (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.67)

Add feedback

josephmisiti/awesome-machine-learning

#artificialintelligenceDec-14-2016, 18:05:55 GMT

It emphasizes flexibility through the elegant use of object-oriented design patterns.

optimizing gpu-meta-programming code generating array, programming language, text processing, (22 more...)

#artificialintelligence

Country:

North America > United States > Illinois (0.04)
Oceania > Samoa (0.04)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
(3 more...)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education (1.00)
Information Technology (0.93)
Health & Medicine (0.93)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(6 more...)

Add feedback

Identifying the number of clusters: finally a solution

@machinelearnbotDec-14-2016, 17:26:14 GMT

It optimizes the number of the cluster when the clustering method is maximizing the variance among the clusters. If you are using for example K-means as clustering algorithm, your method will fail for every number of cluster you try to use! As you can see doesn't exist the right number of clusters, for this problem using the "naive" kmeans. BTW I've seen for kmeans and density based clustering algo, methods based on EM (expectation and maximizazion) and Bayesian information criterion (BIC) that are a little bit more robust than this method. Could you share the table of the points...just to play a little bit with them:)

artificial intelligence, identifying, machine learning

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Robust Local Scaling using Conditional Quantiles of Graph Similarities

Thiagarajan, Jayaraman J., Sattigeri, Prasanna, Ramamurthy, Karthikeyan Natesan, Kailkhura, Bhavya

arXiv.org Machine LearningDec-14-2016

Spectral analysis of neighborhood graphs is one of the most widely used techniques for exploratory data analysis, with applications ranging from machine learning to social sciences. In such applications, it is typical to first encode relationships between the data samples using an appropriate similarity function. Popular neighborhood construction techniques such as k-nearest neighbor (k-NN) graphs are known to be very sensitive to the choice of parameters, and more importantly susceptible to noise and varying densities. In this paper, we propose the use of quantile analysis to obtain local scale estimates for neighborhood graph construction. To this end, we build an auto-encoding neural network approach for inferring conditional quantiles of a similarity function, which are subsequently used to obtain robust estimates of the local scales. In addition to being highly resilient to noise or outlying data, the proposed approach does not require extensive parameter tuning unlike several existing methods. Using applications in spectral clustering and single-example label propagation, we show that the proposed neighborhood graphs outperform existing locally scaled graph construction approaches.

artificial intelligence, graph, machine learning, (18 more...)

arXiv.org Machine Learning

1612.04875

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.95)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.55)

Add feedback

Fast Stability Scanning for Future Grid Scenario Analysis

Liu, Ruidong, Verbic, Gregor, Ma, Jin

arXiv.org Machine LearningDec-13-2016

Future grid scenario analysis requires a major departure from conventional power system planning, where only a handful of most critical conditions is typically analyzed. To capture the inter-seasonal variations in renewable generation of a future grid scenario necessitates the use of computationally intensive time-series analysis. In this paper, we propose a planning framework for fast stability scanning of future grid scenarios using a novel feature selection algorithm and a novel self-adaptive PSO-k-means clustering algorithm. To achieve the computational speed-up, the stability analysis is performed only on small number of representative cluster centroids instead of on the full set of operating conditions. As a case study, we perform small-signal stability and steady-state voltage stability scanning of a simplified model of the Australian National Electricity Market with significant penetration of renewable generation. The simulation results show the effectiveness of the proposed approach. Compared to an exhaustive time series scanning, the proposed framework reduced the computational burden up to ten times, with an acceptable level of accuracy.

artificial intelligence, machine learning, stability, (13 more...)

arXiv.org Machine Learning

1701.03436

Country: Oceania > Australia > New South Wales (0.14)

Genre: Research Report > New Finding (0.88)

Industry:

Energy > Renewable > Solar (1.00)
Energy > Power Industry > Utilities (0.68)
Energy > Renewable > Wind (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback