Goto

Collaborating Authors

 Nearest Neighbor Methods


Neural Neighborhood Encoding for Classification

arXiv.org Machine Learning

Inspired by the fruit-fly olfactory circuit, the Fly Bloom Filter [Dasgupta et al., 2018] is able to efficiently summarize the data with a single pass and has been used for novelty detection. We propose a new classifier (for binary and multi-class classification) that effectively encodes the different local neighborhoods for each class with a per-class Fly Bloom Filter. The inference on test data requires an efficient {\tt FlyHash} [Dasgupta, et al., 2017] operation followed by a high-dimensional, but {\em sparse}, dot product with the per-class Bloom Filters. The learning is trivially parallelizable. On the theoretical side, we establish conditions under which the prediction of our proposed classifier on any test example agrees with the prediction of the nearest neighbor classifier with high probability. We extensively evaluate our proposed scheme with over $50$ data sets of varied data dimensionality to demonstrate that the predictive performance of our proposed neuroscience inspired classifier is competitive the the nearest-neighbor classifiers and other single-pass classifiers.


A Formally Robust Time Series Distance Metric

arXiv.org Machine Learning

Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In this work, we propose a novel distance metric that is robust against arbitrarily "bad" contamination and has a worst-case computational complexity of $\mathcal{O}(n\log n)$. We formally argue why our proposed metric is robust, and demonstrate in an empirical evaluation that the metric yields competitive classification accuracy when applied in k-Nearest Neighbor time series classification.


Machine Learning Algorithms

#artificialintelligence

Arthur Samuel (1959): "Field of study that gives computers the ability to learn without being explicitly programmed". Tom Mitchel (1997): "A computer program is said to learn if its performance at a task T, as measured by a performance P, improves with experience E". Selecting a right machine-learning algorithm depends on several factors, including the data size, quality and nature of data. Choosing the right algorithm is both a combination of business need, specification, experimentation and time available. Here we will explore different machine learning algorithms. In supervised learning, we provide a known dataset that includes inputs and desired outputs.


K-Nearest Neighbors Algorithm

#artificialintelligence

KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution. In other words, the model structure determined from the dataset. This will be very helpful in practice where most of the real-world datasets do not follow mathematical theoretical assumptions. KNN is one of the most simple and traditional non-parametric techniques to classify samples. Given an input vector, KNN calculates the approximate distances between the vectors and then assign the points which are not yet labeled to the class of its K-nearest neighbors. The lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase.


Heterogeneous Swarms for Maritime Dynamic Target Search and Tracking

arXiv.org Artificial Intelligence

Current strategies employed for maritime target search and tracking are primarily based on the use of agents following a predetermined path to perform a systematic sweep of a search area. Recently, dynamic Particle Swarm Optimization (PSO) algorithms have been used together with swarming multi-robot systems (MRS), giving search and tracking solutions the added properties of robustness, scalability, and flexibility. Swarming MRS also give the end-user the opportunity to incrementally upgrade the robotic system, inevitably leading to the use of heterogeneous swarming MRS. However, such systems have not been well studied and incorporating upgraded agents into a swarm may result in degraded mission performances. In this paper, we propose a PSO-based strategy using a topological k-nearest neighbor graph with tunable exploration and exploitation dynamics with an adaptive repulsion parameter. This strategy is implemented within a simulated swarm of 50 agents with varying proportions of fast agents tracking a target represented by a fictitious binary function. Through these simulations, we are able to demonstrate an increase in the swarm's collective response level and target tracking performance by substituting in a proportion of fast buoys.


Innovative Platform for Designing Hybrid Collaborative & Context-Aware Data Mining Scenarios

arXiv.org Artificial Intelligence

The process of knowledge discovery involves nowadays a major number of techniques. Context-Aware Data Mining (CADM) and Collaborative Data Mining (CDM) are some of the recent ones. the current research proposes a new hybrid and efficient tool to design prediction models called Scenarios Platform-Collaborative & Context-Aware Data Mining (SP-CCADM). Both CADM and CDM approaches are included in the new platform in a flexible manner; SP-CCADM allows the setting and testing of multiple configurable scenarios related to data mining at once. The introduced platform was successfully tested and validated on real life scenarios, providing better results than each standalone technique-CADM and CDM. Nevertheless, SP-CCADM was validated with various machine learning algorithms-k-Nearest Neighbour (k-NN), Deep Learning (DL), Gradient Boosted Trees (GBT) and Decision Trees (DT). SP-CCADM makes a step forward when confronting complex data, properly approaching data contexts and collaboration between data. Numerical experiments and statistics illustrate in detail the potential of the proposed platform.


Machine Learning for a Better Developer Experience

#artificialintelligence

Imagine having to go through 2.5GB of log entries from a failed software build -- 3 million lines -- to search for a bug or a regression that happened on line 1M. However, one smart approach to make it tractable might be to diff the lines against a recent successful build, with the hope that the bug produces unusual lines in the logs. Standard md5 diff would run quickly but still produce at least hundreds of thousands candidate lines to look through because it surfaces character-level differences between lines. Fuzzy diffing using k-nearest neighbors clustering from machine learning (the kind of thing logreduce does) produces around 40,000 candidate lines but takes an hour to complete. Our solution produces 20,000 candidate lines in 20 min of computing -- and thanks to the magic of open source, it's only about a hundred lines of Python code.


K-NN classification. Why it is interesting for beginners?

#artificialintelligence

According to experience, this is one of interesting and easy to use an algorithm which makes classification very easy. What is the K-Nearest Neighbour (KNN) algorithm? What is the need of KNN algorithm? When we use KNN algorithm? How to select K for the KNN algorithm.


KNNImputer

#artificialintelligence

The idea in kNN methods is to identify'k' samples in the dataset that are similar or close in the space. Then we use these'k' samples to estimate the value of the missing data points. Each sample's missing values are imputed using the mean value of the'k'-neighbors found in the dataset. Let's look at an example to understand this. Consider a pair of observations in a two-dimensional space (2,0), (2,2), (3,3).


Every Machine Learning Algorithm Can Be Represented as a Neural Network

#artificialintelligence

It seems that all of the work in machine learning -- starting from early research in the 1950s -- cumulated with the creation of the neural network. Successively, algorithm after new algorithm were proposed, from logistic regression to support vector machines, but the neural network is, very literally, the algorithm of algorithms and the pinnacle of machine learning. It's a universal generalization of what machine learning is, instead of one attempt of doing it. In this sense, it is more of a framework and a concept than simply an algorithm, and this is evident given the massive amount of freedom in constructing neural networks -- hidden layer & node counts, activation functions, optimizers, loss functions, network types (convolutional, recurrent, etc.), and specialized layers (batch norm, dropout, etc.), to name a few. From this perspective of neural networks being a concept rather than a rigid algorithm comes a very interesting corollary: any machine learning algorithm, be it decision trees or k-nearest neighbors, can be represented using a neural network.