Goto

Collaborating Authors

 Clustering


Randomized Clustered Nystrom for Large-Scale Kernel Machines

arXiv.org Machine Learning

The Nystrom method has been popular for generating the low-rank approximation of kernel matrices that arise in many machine learning problems. The approximation quality of the Nystrom method depends crucially on the number of selected landmark points and the selection procedure. In this paper, we present a novel algorithm to compute the optimal Nystrom low-approximation when the number of landmark points exceed the target rank. Moreover, we introduce a randomized algorithm for generating landmark points that is scalable to large-scale data sets. The proposed method performs K-means clustering on low-dimensional random projections of a data set and, thus, leads to significant savings for high-dimensional data sets. Our theoretical results characterize the tradeoffs between the accuracy and efficiency of our proposed method. Extensive experiments demonstrate the competitive performance as well as the efficiency of our proposed method.


Hierarchical Partitioning of the Output Space in Multi-label Data

arXiv.org Machine Learning

Hierarchy Of Multi-label classifiers (HOMER) is a multi-label learning algorithm that breaks the initial learning task to several, easier sub-tasks by first constructing a hierarchy of labels from a given label set and secondly employing a given base multi-label classifier (MLC) to the resulting sub-problems. The primary goal is to effectively address class imbalance and scalability issues that often arise in real-world multi-label classification problems. In this work, we present the general setup for a HOMER model and a simple extension of the algorithm that is suited for MLCs that output rankings. Furthermore, we provide a detailed analysis of the properties of the algorithm, both from an aspect of effectiveness and computational complexity. A secondary contribution involves the presentation of a balanced variant of the k means algorithm, which serves in the first step of the label hierarchy construction. We conduct extensive experiments on six real-world datasets, studying empirically HOMER's parameters and providing examples of instantiations of the algorithm with different clustering approaches and MLCs, The empirical results demonstrate a significant improvement over the given base MLC.


Fuzzy Longest Common Subsequence Matching With FCM Using R

arXiv.org Artificial Intelligence

Capturing the interdependencies between real valued time series can be achieved by finding common similar patterns. The abstraction of time series makes the process of finding similarities closer to the way as humans do. Therefore, the abstraction by means of a symbolic levels and finding the common patterns attracts researchers. One particular algorithm, Longest Common Subsequence, has been used successfully as a simila rity measure between two sequences including real valued time series. In this paper, we propose Fuzzy Longest Common Subsequence matching for time series.


109 Commonly Asked Data Science Interview Questions

#artificialintelligence

What is the Central Limit Theorem and why is it important? How many sampling methods do you know? What is the difference between Type I vs Type II error? What do the terms P-value, coefficient, R-Squared value mean? What is the significance of each of these components? What are the assumptions required for linear regression? There are four major assumptions: 1. There is a linear relationship between the variables, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable. What is an example of a dataset with a non-Gaussian distribution?


Comparison Between Global Vs Local Normalization of Tweets, and Various Distances

@machinelearnbot

From the text mining literature, it appears that practitioners tend to utilize Cosine Distance to compare 2 documents. They have used it with great success. From our previous blog, we also used Cosine Distance and we also found it extremely good and helping us, and our clustering method, get an insight in the UK Exit Referendum. In here, we decided to change our initial conditions and see if we get different outcomes,i.e. We decided to try 4 others distances: Jaccard, Matching, Rogers Tanimoto and Euclidean.


Data Driven Resource Allocation for Distributed Learning

arXiv.org Machine Learning

In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.



Identifying the number of clusters: finally a solution

@machinelearnbot

It optimizes the number of the cluster when the clustering method is maximizing the variance among the clusters. If you are using for example K-means as clustering algorithm, your method will fail for every number of cluster you try to use! As you can see doesn't exist the right number of clusters, for this problem using the "naive" kmeans. BTW I've seen for kmeans and density based clustering algo, methods based on EM (expectation and maximizazion) and Bayesian information criterion (BIC) that are a little bit more robust than this method. Could you share the table of the points...just to play a little bit with them:)


Robust Local Scaling using Conditional Quantiles of Graph Similarities

arXiv.org Machine Learning

Spectral analysis of neighborhood graphs is one of the most widely used techniques for exploratory data analysis, with applications ranging from machine learning to social sciences. In such applications, it is typical to first encode relationships between the data samples using an appropriate similarity function. Popular neighborhood construction techniques such as k-nearest neighbor (k-NN) graphs are known to be very sensitive to the choice of parameters, and more importantly susceptible to noise and varying densities. In this paper, we propose the use of quantile analysis to obtain local scale estimates for neighborhood graph construction. To this end, we build an auto-encoding neural network approach for inferring conditional quantiles of a similarity function, which are subsequently used to obtain robust estimates of the local scales. In addition to being highly resilient to noise or outlying data, the proposed approach does not require extensive parameter tuning unlike several existing methods. Using applications in spectral clustering and single-example label propagation, we show that the proposed neighborhood graphs outperform existing locally scaled graph construction approaches.


Fast Stability Scanning for Future Grid Scenario Analysis

arXiv.org Machine Learning

Future grid scenario analysis requires a major departure from conventional power system planning, where only a handful of most critical conditions is typically analyzed. To capture the inter-seasonal variations in renewable generation of a future grid scenario necessitates the use of computationally intensive time-series analysis. In this paper, we propose a planning framework for fast stability scanning of future grid scenarios using a novel feature selection algorithm and a novel self-adaptive PSO-k-means clustering algorithm. To achieve the computational speed-up, the stability analysis is performed only on small number of representative cluster centroids instead of on the full set of operating conditions. As a case study, we perform small-signal stability and steady-state voltage stability scanning of a simplified model of the Australian National Electricity Market with significant penetration of renewable generation. The simulation results show the effectiveness of the proposed approach. Compared to an exhaustive time series scanning, the proposed framework reduced the computational burden up to ten times, with an acceptable level of accuracy.