Zhang, Debing (Zhejiang University) | Yang, Genmao (Zhejiang University) | Hu, Yao (Zhejiang University) | Jin, Zhongming (Zhejiang University) | Cai, Deng (Zhejiang University) | He, Xiaofei (Zhejiang University)
Nowadays, Nearest Neighbor Search becomes more and more important when facing the challenge of big data. Traditionally, to solve this problem, researchers mainly focus on building effective data structures such as hierarchical k-means tree or using hashing methods to accelerate the query process. In this paper, we propose a novel unified approximate nearest neighbor search scheme to combine the advantages of both the effective data structure and the fast Hamming distance computation in hashing methods. In this way, the searching procedure can be further accelerated. Computational complexity analysis and extensive experiments have demonstrated the effectiveness of our proposed scheme.
Can we leverage learning techniques to build a fast nearest-neighbor (NN) retrieval data structure? We present a general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate. We explore the potential of this novel framework through two popular NN data structures: KD-trees and the rectilinear structures employed by locality sensitive hashing. We derive a generalization theory for these data structure classes and present simple learning algorithms for both. Experimental results reveal that learning often improves on the already strong performance of these data structures.
A host of different tasks--such as identifying the song in a database most similar to your favorite song, or the drug most likely to interact with a given molecule--have the same basic problem at their core: finding the point in a dataset that is closest to a given point. This "nearest neighbor" problem shows up all over the place in machine learning, pattern recognition, and data analysis, as well as many other fields. Yet the nearest neighbor problem is not really a single problem. Instead, it has as many different manifestations as there are different notions of what it means for data points to be similar. In recent decades, computer scientists have devised efficient nearest neighbor algorithms for a handful of different definitions of similarity: the ordinary Euclidean distance between points, and a few other distance measures.
Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply "remember" all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree.).
Hi there, is everything cool? Edited Nearest Neighbors Rule for undersampling involves using K 3 nearest neighbors to the data points that are misclassified and that are then removed before a K 1 classification rule is applied. This approach of resampling and classification was first proposed by Dennis Wilson in his 1972 paper titled "Asymptotic Properties of Nearest Neighbor Rules Using Edited Data." When used as an undersampling procedure, the rule can be applied to each example in the majority class, allowing those examples that are misclassified as belonging to the minority class to be removed and those correctly classified to remain. Let's see how can we apply the ENN And just like CNN, the ENN gives the best results when combined with another oversampling method like SMOTE.