AITopics | Nearest Neighbor Methods

Collaborating Authors

Nearest Neighbor Methods

News Overviews Instructional Materials AI-Alerts Classics

Machine Learning Resources for Spam Detection

@machinelearnbotMar-20-2016, 19:40:48 GMT

Spam is a kind of messaging where the cost of sending is usually negligible and the receiver and the ISP pays the cost in terms of bandwidth usage. An example of a manual approach to detecting spam is using knowledge engineering. If the subject line of an email contains words'Buy viagra' its spam These rules can be configured by the user himself or by the email provider and if correctly thought out and executed this technique can be effectively be used to combat spam. This is a blog post about one such implementation. However, a manual rules based approach doesn't scale because of active human spammers circumventing any manual rules.

machine learning, spam, spam filtering, (11 more...)

@machinelearnbot

Industry: Education (0.40)

Technology:

Information Technology > Security & Privacy > Spam Filtering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.40)

Add feedback

A Spectral Series Approach to High-Dimensional Nonparametric Regression

Lee, Ann B., Izbicki, Rafael

arXiv.org Machine LearningJan-31-2016

A key question in modern statistics is how to make fast and reliable inferences for complex, high-dimensional data. While there has been much interest in sparse techniques, current methods do not generalize well to data with nonlinear structure. In this work, we present an orthogonal series estimator for predictors that are complex aggregate objects, such as natural images, galaxy spectra, trajectories, and movies. Our series approach ties together ideas from kernel machine learning, and Fourier methods. We expand the unknown regression on the data in terms of the eigenfunctions of a kernel-based operator, and we take advantage of orthogonality of the basis with respect to the underlying data distribution, P, to speed up computations and tuning of parameters. If the kernel is appropriately chosen, then the eigenfunctions adapt to the intrinsic geometry and dimension of the data. We provide theoretical guarantees for a radial kernel with varying bandwidth, and we relate smoothness of the regression function with respect to P to sparsity in the eigenbasis. Finally, using simulated and real-world data, we systematically compare the performance of the spectral series approach with classical kernel smoothing, k-nearest neighbors regression, kernel ridge regression, and state-of-the-art manifold and local regression methods.

artificial intelligence, lee and izbicki nonparametric regression, machine learning, (14 more...)

arXiv.org Machine Learning

doi: 10.1214/16-EJS1112

1602.00355

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)

Add feedback

k-Nearest Neighbour Classification of Datasets with a Family of Distances

Hatko, Stan

arXiv.org Machine LearningNov-28-2015

The $k$-nearest neighbour ($k$-NN) classifier is one of the oldest and most important supervised learning algorithms for classifying datasets. Traditionally the Euclidean norm is used as the distance for the $k$-NN classifier. In this thesis we investigate the use of alternative distances for the $k$-NN classifier. We start by introducing some background notions in statistical machine learning. We define the $k$-NN classifier and discuss Stone's theorem and the proof that $k$-NN is universally consistent on the normed space $R^d$. We then prove that $k$-NN is universally consistent if we take a sequence of random norms (that are independent of the sample and the query) from a family of norms that satisfies a particular boundedness condition. We extend this result by replacing norms with distances based on uniformly locally Lipschitz functions that satisfy certain conditions. We discuss the limitations of Stone's lemma and Stone's theorem, particularly with respect to quasinorms and adaptively choosing a distance for $k$-NN based on the labelled sample. We show the universal consistency of a two stage $k$-NN type classifier where we select the distance adaptively based on a split labelled sample and the query. We conclude by giving some examples of improvements of the accuracy of classifying various datasets using the above techniques.

artificial intelligence, machine learning, sequence, (17 more...)

arXiv.org Machine Learning

1512.00001

Country:

North America > United States (0.28)
North America > Canada (0.27)
Europe > Austria (0.27)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.42)

Add feedback

The Ancient Art of the Numerati

#artificialintelligenceNov-9-2015, 16:51:50 GMT

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.It is available as a free download under a Creative Commons license. You are free to share the book, translate it, or remix it. Before you is a tool for learning basic data mining techniques. Most data mining textbooks focus on providing a theoretical foundation for data mining, and as result, may seem notoriously difficult to understand. Don't get me wrong, the information in those books is extremely important.

ancient art, data mining technique, numerati, (5 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.32)

Add feedback

Point Localization and Density Estimation from Ordinal kNN graphs using Synchronization

Cucuringu, Mihai, Woodworth, Joseph

arXiv.org Machine LearningNov-5-2015

We consider the problem of embedding unweighted, directed k-nearest neighbor graphs in low-dimensional Euclidean space. The k-nearest neighbors of each vertex provides ordinal information on the distances between points, but not the distances themselves. We use this ordinal information along with the low-dimensionality to recover the coordinates of the points up to arbitrary similarity transformations (rigid transformations and scaling). Furthermore, we also illustrate the possibility of robustly recovering the underlying density via the Total Variation Maximum Penalized Likelihood Estimation (TV-MPLE) method. We make existing approaches scalable by using an instance of a local-to-global algorithm based on group synchronization, recently proposed in the literature in the context of sensor network localization and structural biology, which we augment with a scaling synchronization step. We demonstrate the scalability of our approach on large graphs, and show how it compares to the Local Ordinal Embedding (LOE) algorithm, which was recently proposed for recovering the configuration of a cloud of points from pairwise ordinal comparisons between a sparse set of distances.

artificial intelligence, graph, machine learning, (17 more...)

arXiv.org Machine Learning

1504.00722

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.95)

Add feedback

Some Theory For Practical Classifier Validation

Bax, Eric, Le, Ya

arXiv.org Machine LearningOct-9-2015

We compare and contrast two approaches to validating a trained classifier while using all in-sample data for training. One is simultaneous validation over an organized set of hypotheses (SVOOSH), the well-known method that began with VC theory. The other is withhold and gap (WAG). WAG withholds a validation set, trains a holdout classifier on the remaining data, uses the validation data to validate that classifier, then adds the rate of disagreement between the holdout classifier and one trained using all in-sample data, which is an upper bound on the difference in error rates. We show that complex hypothesis classes and limited training data can make WAG a favorable alternative.

artificial intelligence, classifier, machine learning, (18 more...)

arXiv.org Machine Learning

1510.02676

Country:

North America > Canada (0.14)
Europe > United Kingdom > England (0.14)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.31)

Add feedback

A nonlinear aggregation type classifier

Cholaquidis, Alejandro, Fraiman, Ricardo, Kalemkerian, Juan, Llop, Pamela

arXiv.org Machine LearningSep-8-2015

Supervised classification is still one of the hot topics for high dimensional and functional data due to the importance of their applications and the intrinsic difficulty in a general setup. In this context, there is a vast literature on classification methods which include: linear classification,k -nearest neighbors and kernel rules, classification based on partial least squares, reproducing kernels or depth measures. Complete surveys of the literature are the works by Ba ıllo et al. [1], Cuevas [13] and Delaigle and Hall [16]. In the book Contributions in infinite-dimensional statistics and related topics [7], there are also several recent advances in supervised and unsupervised classification. See for instance, Chapters 2, 5, 22 or 48, or directly, Chapter 1 of this issue (Bongiorno et al. [6]). In this context, very recently there have been of great interest to develop aggregation methods. In particular, there is a large list of linear aggregation methods like boosting (Breiman [8], Breiman [9]), random forest (Breiman [10], Biau et al. [3], Biau [5]), among others. All these methods exhibit an important improvement when combining a subset of classifiers to produce a new one. Most of the contributions to the aggregation literature have been proposed for nonparametric regression, a problem closely related to classification rules, which can be obtained just by plugging in the estimate of the regression function into the Bayes rule (see for instance, Yang [19] and Bunea et al. [11]).

artificial intelligence, classifier, machine learning, (13 more...)

arXiv.org Machine Learning

1509.01604

Genre:

Research Report (0.82)
Overview (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.69)

Add feedback

Stabilized Nearest Neighbor Classifier and Its Statistical Properties

Sun, Wei, Qiao, Xingye, Cheng, Guang

arXiv.org Machine LearningAug-30-2015

The stability of statistical analysis is an important indicator for reproducibility, which is one main principle of scientific method. It entails that similar statistical conclusions can be reached based on independent samples from the same underlying population. In this paper, we introduce a general measure of classification instability (CIS) to quantify the sampling variability of the prediction made by a classification method. Interestingly, the asymptotic CIS of any weighted nearest neighbor classifier turns out to be proportional to the Euclidean norm of its weight vector. Based on this concise form, we propose a stabilized nearest neighbor (SNN) classifier, which distinguishes itself from other nearest neighbor classifiers, by taking the stability into consideration. In theory, we prove that SNN attains the minimax optimal convergence rate in risk, and a sharp convergence rate in CIS. The latter rate result is established for general plug-in classifiers under a low-noise condition. Extensive simulated and real examples demonstrate that SNN achieves a considerable improvement in CIS over existing nearest neighbor classifiers, with comparable classification accuracy. We implement the algorithm in a publicly available R package snn.

artificial intelligence, classifier, machine learning, (16 more...)

arXiv.org Machine Learning

1405.6642

Country:

North America > United States > Indiana (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.45)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (1.00)

Add feedback

Solving the Partial Label Learning Problem: An Instance-Based Approach

Zhang, Min-Ling (Southeast University) | Yu, Fei (Southeast University)

AAAI ConferencesJul-15-2015

In partial label learning, each training example is associated with a set of candidate labels, among which only one is valid. An intuitive strategy to learn from partial label examples is to treat all candidate labels equally and make prediction by averaging their modeling outputs. Nonetheless, this strategy may suffer from the problem that the modeling output from the valid label is overwhelmed by those from the false positive labels. In this paper, an instance-based approach named IPAL is proposed by directly disambiguating the candidate label set. Briefly, IPAL tries to identify the valid label of each partial label example via an iterative label propagation procedure, and then classifies the unseen instance based on minimum error reconstruction from its nearest neighbors. Extensive experiments show that IPAL compares favorably against the existing instance-based as well as other state-of-the-art partial label learning approaches.

algorithm, candidate label, partial label, (13 more...)

AAAI Conferences

Twenty-Fourth International Joint Conference on Artificial Intelligence

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
Asia > China > Beijing > Beijing (0.04)
(8 more...)

Industry: Education > Focused Education > Special Education (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.31)

Add feedback

Image Data Compression for Covariance and Histogram Descriptors

Kusner, Matt J., Kolkin, Nicholas I., Tyree, Stephen, Weinberger, Kilian Q.

arXiv.org Machine LearningMay-23-2015

Covariance and histogram image descriptors provide an effective way to capture information about images. Both excel when used in combination with special purpose distance metrics. For covariance descriptors these metrics measure the distance along the non-Euclidean Riemannian manifold of symmetric positive definite matrices. For histogram descriptors the Earth Mover's distance measures the optimal transport between two histograms. Although more precise, these distance metrics are very expensive to compute, making them impractical in many applications, even for data sets of only a few thousand examples. In this paper we present two methods to compress the size of covariance and histogram datasets with only marginal increases in test error for k-nearest neighbor classification. Specifically, we show that we can reduce data sets to 16% and in some cases as little as 2% of their original size, while approximately matching the test error of kNN classification on the full training set. In fact, because the compressed set is learned in a supervised fashion, it sometimes even outperforms the full data set, while requiring only a fraction of the space and drastically reducing test-time computation.

artificial intelligence, descriptor, machine learning, (18 more...)

arXiv.org Machine Learning

1412.174

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.87)

Add feedback