Goto

Collaborating Authors

 Nearest Neighbor Methods


Flattening the Density Gradient for Eliminating Spatial Centrality to Reduce Hubness

AAAI Conferences

Spatial centrality, whereby samples closer to the center of a dataset tend to be closer to all other samples, is regarded as one source of hubness. Hubness is well known to degrade k-nearest-neighbor (k-NN) classification. Spatial centrality can be removed by centering, i.e., shifting the origin to the global center of the dataset, in cases where inner product similarity is used. However, when Euclidean distance is used, centering has no effect on spatial centrality because the distance between the samples is the same before and after centering. As described in this paper, we propose a solution for the hubness problem when Euclidean distance is considered. We provide a theoretical explanation to demonstrate how the solution eliminates spatial centrality and reduces hubness. We then present some discussion of the reason the proposed solution works, from a viewpoint of density gradient, which is regarded as the origin of spatial centrality and hubness. We demonstrate that the solution corresponds to flattening the density gradient. Using real-world datasets, we demonstrate that the proposed method improves k-NN classification performance and outperforms an existing hub-reduction method.


Variable dimension data? โ€ข /r/MachineLearning

@machinelearnbot

You could do K-nearest neighbor's interpolation to give the empty 0 values a "guess" to how they would look like to the nearest neighbors. How well this would work is really just based on the properties of the data. If dimension k can be predicted by some association with a dimension j, and this relationship with k and j is fairly strong throughout the data, then it's worth trying. If it's all over the place, this hack won't help at all, perhaps it would even make very unreliable predictions.


Study Identifies Key Factors Associated With Dementia Pathogenesis

#artificialintelligence

Recent research has identified independent predictors of dementia to include age at diagnosis, transient ischemic attack and stroke status, and years of education, with vascular factors playing a greater role in disease pathogenesis than previously thought. The findings were presented at the 2016 annual meeting of the American Academy of Neurology (AAN). In the abstract, the researchers wrote that dementia encompasses a broad set of neurologic diseases, producing progressive declines in memory and/or thinking faculties, sometimes alongside personality and emotional disturbances. "Worldwide, approximately 35.6 million people have dementia, and this number is only expected to grow due to an aging population," they wrote. "Unfortunately, it is exceedingly difficult to predict who will develop dementia, let alone what type. This makes it difficult to mobilize various preventive strategies supported by mounting evidence."


Learning Vector Quantization for Machine Learning - Machine Learning Mastery

#artificialintelligence

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that lets you choose how many training instances to hang onto and learns exactly what those instances should look like. In this post you will discover the Learning Vector Quantization algorithm. This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems.


K-Nearest Neighbors for Machine Learning - Machine Learning Mastery

#artificialintelligence

In this post you will discover the k-Nearest Neighbors (KNN) algorithm for classification and regression. After reading this post you will know. This post was written for developers and assumes no background in statistics or mathematics. The focus is on how the algorithm works and how to use it for predictive modeling problems. If you have any questions, leave a comment and I will do my best to answer.


Feature extraction using Latent Dirichlet Allocation and Neural Networks: A case study on movie synopses

arXiv.org Machine Learning

Feature extraction has gained increasing attention in the field of machine learning, as in order to detect patterns, extract information, or predict future observations from big data, the urge of informative features is crucial. The process of extracting features is highly linked to dimensionality reduction as it implies the transformation of the data from a sparse high-dimensional space, to higher level meaningful abstractions. This dissertation employs Neural Networks for distributed paragraph representations, and Latent Dirichlet Allocation to capture higher level features of paragraph vectors. Although Neural Networks for distributed paragraph representations are considered the state of the art for extracting paragraph vectors, we show that a quick topic analysis model such as Latent Dirichlet Allocation can provide meaningful features too. We evaluate the two methods on the CMU Movie Summary Corpus, a collection of 25,203 movie plot summaries extracted from Wikipedia. Finally, for both approaches, we use K-Nearest Neighbors to discover similar movies, and plot the projected representations using T-Distributed Stochastic Neighbor Embedding to depict the context similarities. These similarities, expressed as movie distances, can be used for movies recommendation. The recommended movies of this approach are compared with the recommended movies from IMDB, which use a collaborative filtering recommendation approach, to show that our two models could constitute either an alternative or a supplementary recommendation approach.


Top 10 Machine Learning Algorithms

#artificialintelligence

Many articles have been written about the top machine learning algorithms: click here and here for instance. Most of them seem to define top as oldest, and thus most used, ignoring modern, efficient algorithms fit for big data, such as indexation, attribution modeling, collaborative filtering, or recommendation engines used by companies such as Amazon, Google, or Facebook. I received this morning and advertisement for a (self-published) book called Master Machine Learning Algorithms, and I could not resist to post the author's list of top 10 machine learning algorithms:: Some of these techniques such as Naive Bayes (variables are almost never uncorrelated), Linear Discriminant Analysis (clusters are almost never separated by hyperplanes), or Linear Regression (numerous model assumptions - including linearity - are almost always violated in real data) have been so abused that I would hesitate teaching them. This is not a criticism of the book; most textbooks mention pretty much the same algorithms, and in this case, even skipping all graph-related algorithms. Even k Nearest Neighbors have modern, fast implementations not covered in traditional books - we are indeed working on this topic and expect to have an article published shortly about it.


Data Science and Machine Learning for Preventing Fraud in Mom and Pop Ecommerce Shops

@machinelearnbot

With the development and growth of ecommerce platforms like Shopify, the number of small- and medium- sized ecommerce businesses is growing at an impressive rate. But, with this growth comes a growth in market opportunities for the online villains and fraudsters out there who are looking to make a quick buck. It used to be that only huge corporations had the resources they needed to detect fraud and protect themselves from its damages. But, in this era of big data and data science for all, even small mom and pop ecommerce shops have access to the tools they need to protect themselves from evil fraudsters. This article introduces some common sources of fraud problems in ecommerce, and how you can use data science technologies or techniques to protect your business (or soon-to-be business) from risk.


Performance From Various Predictive Models

@machinelearnbot

Guest blog post by Dalila Benachenhou, originally posted here. Dalila is Professor at George Washington University. In this article, benchmarks were computed on a specific data set, for Geico Calls Prediction, comparing Random Forests, Neural Networks, SVM, FDA, K Nearest Neighbors, C5.0 (Decision Trees), Logistic Regression, and Cart. Introduction: In the first blog, we decided on the predictors. We knew that different predictive models have different assumptions about their predictors.


Unsupervised Transductive Domain Adaptation

arXiv.org Machine Learning

Supervised learning with large scale labeled datasets and deep layered models has made a paradigm shift in diverse areas in learning and recognition. However, this approach still suffers generalization issues under the presence of a domain shift between the training and the test data distribution. In this regard, unsupervised domain adaptation algorithms have been proposed to directly address the domain shift problem. In this paper, we approach the problem from a transductive perspective. We incorporate the domain shift and the transductive target inference into our framework by jointly solving for an asymmetric similarity metric and the optimal transductive target label assignment. We also show that our model can easily be extended for deep feature learning in order to learn features which are discriminative in the target domain. Our experiments show that the proposed method significantly outperforms state-of-the-art algorithms in both object recognition and digit classification experiments by a large margin.