Goto

Collaborating Authors

 Nearest Neighbor Methods


Building a Recommendation System for the Cooper Hewitt Design Museum

@machinelearnbot

The Cooper Hewitt Design Museum houses an impressive collection of designed objects that chronicle the history and significance of design in our evolving world. These objects range from unrealized works of architecture to handwoven textiles from Africa to graphic designed posters that reflect the culture and pulse of humanity of their time. The museum is housed in the former mansion of Andrew Carnegie. Upon its completion in 1901, the sixty-four room mansion was the first private residence in the United States to have a structural steel frame that allowed for more expansive spaces and a feeling of lightness. The Carnegie Mansion was also the first private residence to have a residential elevator, central heating, and a precursor to central AC.


Unsupervised clustering under the Union of Polyhedral Cones (UOPC) model

arXiv.org Machine Learning

In this paper, we consider clustering data that is assumed to come from one of finitely many pointed convex polyhedral cones. This model is referred to as the Union of Polyhedral Cones (UOPC) model. Similar to the Union of Subspaces (UOS) model where each data from each subspace is generated from a (unknown) basis, in the UOPC model each data from each cone is assumed to be generated from a finite number of (unknown) \emph{extreme rays}.To cluster data under this model, we consider several algorithms - (a) Sparse Subspace Clustering by Non-negative constraints Lasso (NCL), (b) Least squares approximation (LSA), and (c) K-nearest neighbor (KNN) algorithm to arrive at affinity between data points. Spectral Clustering (SC) is then applied on the resulting affinity matrix to cluster data into different polyhedral cones. We show that on an average KNN outperforms both NCL and LSA and for this algorithm we provide the deterministic conditions for correct clustering. For an affinity measure between the cones it is shown that as long as the cones are not very coherent and as long as the density of data within each cone exceeds a threshold, KNN leads to accurate clustering. Finally, simulation results on real datasets (MNIST and YaleFace datasets) depict that the proposed algorithm works well on real data indicating the utility of the UOPC model and the proposed algorithm.


k-Nearest Neighbors & Anomaly Detection Tutorial

#artificialintelligence

Announcement Layman Tutorials for Data Science site Annalyzin is now called Algobeans! We're creating a new mailing list to deliver tutorials to your inbox. If you'd like to be included, sign up: If you're already subscribed, signing up to this new mailing list will remove you from the old one. Have you ever wondered about the difference between red and white wine? Some assume that red wine is made from red grapes, and white wine is made from white grapes.


Data Science: Supervised Machine Learning in Python

@machinelearnbot

In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.


An Introduction to Machine Learning in Julia

#artificialintelligence

Machine learning is now pervasive in every field of inquiry and has lead to breakthroughs in various fields from medical diagnoses to online advertising. Practical machine learning is quite computationally intensive, whether it involves millions of repetitions of simple mathematical methods such as Euclidian Distance or more intricate optimizers or backpropagation algorithms. Such computationally intensive techniques need a fast and expressive language – one that enables scientists to write simple, readable code that performs well. In this post, we introduce a simple machine learning algorithm called K Nearest Neighbors, and demonstrate certain Julia features that allow for its easy and efficient implementation. We will demonstrate that the code we write is inherently generic, and show the use of the same code to run on GPUs via the ArrayFire package.


Decision Trees and Political Party Classification

#artificialintelligence

Last time we investigated the k-nearest-neighbors algorithm and the underlying idea that one can learn a classification rule by copying the known classification of nearby data points. This required that we view our data as sitting inside a metric space; that is, we imposed a kind of geometric structure on our data. One glaring problem is that there may be no reasonable way to do this. While we mentioned scaling issues and provided a number of possible metrics in our primer, a more common problem is that the data simply isn't numeric. For instance, a poll of US citizens might ask the respondent to select which of a number of issues he cares most about. There could be 50 choices, and there is no reasonable way to assign these numerical values so that all are equidistant in the resulting metric space. Another issue is that the quality of the data could be bad. For instance, there may be missing values for some attributes (e.g., a respondent may neglect to answer one or more questions).


?hat Intuitive Classification using KNN and Python

#artificialintelligence

K-nearest neighbors, or KNN, is a supervised learning algorithm for either classification or regression. It's super intuitive and has been applied to many types of problems. To make a personalized offer to one customer, you might employ KNN to find similar customers and base your offer on their purchase behaviors. KNN has also been applied to medical diagnosis and credit scoring. This is a post about the K-nearest neighbors algorithm and Python.


Using Z-values to efficiently compute k-nearest neighbors for Apache Flink – Insight Data

#artificialintelligence

In an earlier post, I described work that I had initially done as an Insight Data Engineering Fellow. That work, now merged into Flink's master branch, was to do an efficient exact k-nearest neighbors (KNN) query using quadtrees. I have since worked on an approximate version of the KNN algorithm, and I will discuss one method I used for the approximate version using Z-value based hashing. For large and high dimensional data sets, an exact k-nearest neighbors query can become infeasible. There are many algorithms that reduce the dimensionality of the points by hashing them to lower dimensions.


Newbie's Guide to ML -- Part 3 – ML for Newbies

#artificialintelligence

In part 1 I gave a brief introduction to classification. Just to recap, classification is the problem of identifying which group a piece of data belongs to. It's an example of supervised learning because the classifier predicts the classes based on the training data fed to it. An example of classification is finding out whether an email is spam or not. More formally, classification is about finding out a model that distinguishes one class of data from another so as to predict the class of data whose class is unknown.


Wasserstein Discriminant Analysis

arXiv.org Machine Learning

Wasserstein Discriminant Analysis (WDA) is a new supervised method that can improve classification of high-dimensional data by computing a suitable linear map onto a lower dimensional subspace. Following the blueprint of classical Linear Discriminant Analysis (LDA), WDA selects the projection matrix that maximizes the ratio of two quantities: the dispersion of projected points coming from different classes, divided by the dispersion of projected points coming from the same class. To quantify dispersion, WDA uses regularized Wasserstein distances, rather than cross-variance measures which have been usually considered, notably in LDA. Thanks to the the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples scale) interactions between classes. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; We show that the optimization of WDA can be tackled using automatic differentiation of Sinkhorn iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real life datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.