Goto

Collaborating Authors

 Nearest Neighbor Methods


Tutorial To Implement k-Nearest Neighbors in Python From Scratch - Machine Learning Mastery

#artificialintelligence

The k-Nearest Neighbors algorithm (or kNN for short) is an easy algorithm to understand and to implement, and a powerful tool to have at your disposal. In this tutorial you will implement the k-Nearest Neighbors algorithm from scratch in Python (2.7). The implementation will be specific for classification problems and will be demonstrated using the Iris flowers classification problem. This tutorial is for you if you are a Python programmer, or a programmer who can pick-up python quickly, and you are interested in how to implement the k-Nearest Neighbors algorithm from scratch. The model for kNN is the entire training dataset.


Machine learning for financial prediction: experimentation with David Aronson's latest work โ€“ part 2

#artificialintelligence

My first post on using machine learning for financial prediction took an in-depth look at various feature selection methods as a data pre-processing step in the quest to mine financial data for profitable patterns. I looked at various methods to identify predictive features including Maximal Information Coefficient (MIC), Recursive Feature Elimination (RFE), algorithms with built-in feature selection, selection via exhaustive search of possible generalized linear models, and the Boruta feature selection algorithm. I personally found the Boruta algorithm to be the most intuitive and elegant approach, but regardless of the method chosen, the same features seemed to keep on turning up in the results. In this post, I will take this analysis further and use these features to build predictive models that could form the basis of autonomous trading systems. Firstly, I'll provide an overview of the algorithms that I have found to generally perform well on this type of machine learning problem as well as those algorithms recommended by David Aronson (2013) in Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments (SSML). I'll also discuss a framework for measuring the performance of various models to facilitate robust comparison and model selection. Finally, I will discuss methods for combining predictions to produce ensembles that perform better than any of the constituent models alone.


Scaling_synthesized_data

#artificialintelligence

In particular, I checked out the k-Nearest Neighbors (k-NN) and logistic regression algorithms and saw how scaling numerical data strongly influenced the performance of the former but not that of the latter, as measured, for example, by accuracy (see Glossary below or previous articles for definitions of scaling, k-NN and other relevant terms). The real take home message here was that preprocessing doesn't occur in a vacuum, that is, you can prepocess the heck out of your data but the proof is in the pudding: how well does your model then perform? Scaling numerical data (that is, multiplying all instances of a variable by a constant in order to change that variable's range) has two related purposes: i) if your measurements are in meters and mine are in miles, then, if we both scale our data, they end up being the same & ii) if two variables have vastly different ranges, the one with the larger range may dominate your predictive model, even though it may be less important to your target variable than the variable with the smaller range. What we saw is that this problem identified in ii) occurs with k-NN, which explicitly looks at how close data are to one another but not in logistic regression which, when being trained, will shrink the relevant coefficient to account for the lack of scaling. As the data we used in the previous articles was real-world data, all we could see was how the models performed before and after scaling.


TAO: System for Table Detection and Extraction from PDF Documents

AAAI Conferences

Digital documents present knowledge in most areas of study, exchanging and communicating information in a portable way. To better use the knowledge embedded in an ever-growing information source, effective tools for automatic information extraction are needed. Tables are crucial information elements in documents of scientific nature. Most publications use tables to represent and report concrete findings of research. Current methods used to extract table data from PDF documents lack precision in detecting, extracting, and representing data from diverse layouts. We present the system TAble Organization (TAO) to automatically detect, extract and organize information from tables in PDF documents. TAO uses a processing, based on the k-nearest neighbor method and layout heuristics, to detect tables within a document and to extract table information. This system generates an enriched representation of the data extracted from tables in the PDF documents. TAOโ€™s performance is comparable to other table extraction methods, but it overcomes some related work limitations and proves to be more robust in experiments with diverse document layouts.


Parallelizing Instance-Based Data Classifiers

AAAI Conferences

In the age of BigData, producing results quickly while operating over vast volumes of data has become a vital requirement for data mining and machine learning applications to a degree that traditional serial algorithms can no longer keep up with these constraints. This paper applies different forms of parallelization techniques to popular instance-based classifiersโ€“namely, a special form of naive Bayes and k-nearest neighborsโ€“in an attempt to compare performance and make broad conclusions applicable to instance-based classifiers. Overall, our experimental results strongly indicate that parallelism over test instances provides the most speedup in most cases compared to other forms of parallelism.


EigenTransitions with Hypothesis Testing: The Anatomy of Urban Mobility

AAAI Conferences

Identifying the patterns in urban mobility is important for a variety of tasks such as transportation planning, urban resource allocation, emergency planning etc. This is evident from the large body of research on the topic, which has exploded with the vast amount of geo-tagged user-generated content from online social media. However, most of the existing work focuses on a specific setting, taking a statistical approach to describe and model the observed patterns. On the contrary in this work we introduce EigenTransitions, a spectrum-based, generic framework for analyzing spatio-temporal mobility datasets. EigenTransitions capture the anatomy of the aggregate and/or individualsโ€™ mobility as a compact set of latent mobility patterns. Using a large corpus of geo-tagged content collected from Twitter, we utilize EigenTransitions to analyze the structure of urban mobility. In particular, we identify the EigenTransitions of a flow network between urban areas and derive hypothesis testing framework to evaluate urban mobility from both temporal and demographic perspectives. We further show how EigenTransitions not only identify latent mobility patterns, but also have the potential to support applications such as mobility prediction and inter-city comparisons. In particular, by identifying neighbors with similar latent mobility patterns and incorporating their historical transition behaviors, we proposed an EigenTransitions-based k-nearest neighbor algorithm, which can significantly improve the performance of individual mobility prediction. The proposed method is especially effective in โ€œcold-startโ€ scenarios where traditional methods are known to perform poorly.


Recognizing Snacks using SimpleCV

#artificialintelligence

This article aims to provide the basic knowledge of how to recognize snacks by using Python and SimpleCV. Readers will gain practical programming knowledge via experimentation with the Python scripts included in the Snack Classifier open source project. To illustrate with a snacks recognition app, the Snack Watcher watches any snacks present on the snack table. For Snack Watcher to determine if there was an interesting event, it needs to process the image into a set of image "Blobs". For each "Blob", Snack Watcher compares the "Blob" with it's previous state to determine if the "Blob" was added, removed or stationary.


K Nearest Neighbors Application - Practical Machine Learning Tutorial with Python p.14

#artificialintelligence

In the last part we introduced Classification, which is a supervised form of machine learning, and explained the K Nearest Neighbors algorithm intuition. In this tutorial, we're actually going to apply a simple example of the algorithm using Scikit-Learn, and then in the subsquent tutorials we'll build our own algorithm to learn more about how it works under the hood. To exemplify classification, we're going to use a Breast Cancer Dataset, which is a dataset donated to the University of California, Irvine (UCI) collection from the University of Wisconsin-Madison. UCI has a large Machine Learning Repository.


Clustering idea for very large datasets

@machinelearnbot

Let's say you have to cluster 10 million points, for instance keywords. So, in short, you can perform k-NN (k-nearest neighbors) clustering or some other types of clustering, which typically is O(n 2) or worse, from a computational complexity point of view. Has anyone ever used a clustering method based on sampling? The idea is to start by sampling 1% (or less) of the 100,000,000 entries, and perform clustering on these pairs of keywords, to create a "seed" or "baseline" cluster structure. The next step is to browse sequentially your 10,000,000 keywords, and for each keyword, find the closest cluster from the baseline cluster structure.


DIY Recommendation Engines for Mom and Pop Ecommerce Shops

#artificialintelligence

Of course we have all heard about machine learning and recommendation engines in big business ecommerce. For quite some time, massive ecommerce businesses like Netflix, Amazon, and Ebay have been leveraging the power of data science to improve customer service and boost sales. Where once this technology was cost-prohibitive to all but the major players, recently things have changed. Thanks to multi-channel ecommerce platforms like Shopify, and the developers who are building custom machine learning add-ons, now mom and pop online businesses get the chance to infuse their operations with the power of data science. In this article I introduce how machine learning algorithms work to produce recommendation systems for small business ecommerce.