Nearest Neighbor Methods
SALL-E: Situated Agent for Language Learning
Perera, Ian (University of Rochester) | Allen, James F. (University of Rochester)
We describe ongoing research towards building a cognitively plausible system for near one-shot learning of the meanings of attribute words and object names, by grounding them in a sensory model. The system learns incrementally from human demonstrations recorded with the Microsoft Kinect, in which the demonstrator can use unrestricted natural language descriptions. We achieve near-one shot learning of simple objects and attributes by focusing solely on examples where the learning agent is confident, ignoring the rest of the data. We evaluate the system's learning ability by having it generate descriptions of presented objects, including objects it has never seen before, and comparing the system response against collected human descriptions of the same objects. We propose that our method of retrieving object examples with a k-nearest neighbor classifier using Mahalanobis distance corresponds to a cognitively plausible representation of objects. Our initial results show promise for achieving rapid, near one-shot, incremental learning of word meanings.
Efficient Estimation of the number of neighbours in Probabilistic K Nearest Neighbour Classification
Probabilistic k-nearest neighbour (PKNN) classification has been introduced to improve the performance of original k-nearest neighbour (KNN) classification algorithm by explicitly modelling uncertainty in the classification of each feature vector. However, an issue common to both KNN and PKNN is to select the optimal number of neighbours, $k$. The contribution of this paper is to incorporate the uncertainty in $k$ into the decision making, and in so doing use Bayesian model averaging to provide improved classification. Indeed the problem of assessing the uncertainty in $k$ can be viewed as one of statistical model selection which is one of the most important technical issues in the statistics and machine learning domain. In this paper, a new functional approximation algorithm is proposed to reconstruct the density of the model (order) without relying on time consuming Monte Carlo simulations. In addition, this algorithm avoids cross validation by adopting Bayesian framework. The performance of this algorithm yielded very good performance on several real experimental datasets.
K-Nearest Neighbour algorithm coupled with logistic regression in medical case-based reasoning systems. Application to prediction of access to the renal transplant waiting list in Brittany
Campillo-Gimenez, Boris, Jouini, Wassim, Bayat, Sahar, Cuggia, Marc
Introduction. Case Based Reasoning (CBR) is an emerg- ing decision making paradigm in medical research where new cases are solved relying on previously solved similar cases. Usually, a database of solved cases is provided, and every case is described through a set of attributes (inputs) and a label (output). Extracting useful information from this database can help the CBR system providing more reliable results on the yet to be solved cases. Objective. For that purpose we suggest a general frame- work where a CBR system, viz. K-Nearest Neighbor (K-NN) algorithm, is combined with various information obtained from a Logistic Regression (LR) model. Methods. LR is applied, on the case database, to assign weights to the attributes as well as the solved cases. Thus, five possible decision making systems based on K-NN and/or LR were identified: a standalone K-NN, a standalone LR and three soft K-NN algorithms that rely on the weights based on the results of the LR. The evaluation of the described approaches is performed in the field of renal transplant access waiting list. Results and conclusion. The results show that our suggested approach, where the K-NN algorithm relies on both weighted attributes and cases, can efficiently deal with non relevant attributes, whereas the four other approaches suffer from this kind of noisy setups. The robustness of this approach suggests interesting perspectives for medical problem solving tools using CBR methodology.
An improvement to k-nearest neighbor classifier
Sarma, T. Hitendra, Viswanath, P., Reddy, D. Sai Koti, Raghava, S. Sri
K-Nearest neighbor classifier (k-NNC) is simple to use and has little design time like finding k values in k-nearest neighbor classifier, hence these are suitable to work with dynamically varying data-sets. There exists some fundamental improvements over the basic k-NNC, like weighted k-nearest neighbors classifier (where weights to nearest neighbors are given based on linear interpolation), using artificially generated training set called bootstrapped training set, etc. These improvements are orthogonal to space reduction and classification time reduction techniques, hence can be coupled with any of them. The paper proposes another improvement to the basic k-NNC where the weights to nearest neighbors are given based on Gaussian distribution (instead of linear interpolation as done in weighted k-NNC) which is also independent of any space reduction and classification time reduction technique. We formally show that our proposed method is closely related to non-parametric density estimation using a Gaussian kernel. We experimentally demonstrate using various standard data-sets that the proposed method is better than the existing ones in most cases.
Combining Feature and Prototype Pruning by Uncertainty Minimization
We focus in this paper on dataset reduction techniques for use in k-nearest neighbor classification. In such a context, feature and prototype selections have always been independently treated by the standard storage reduction algorithms. While this certifying is theoretically justified by the fact that each subproblem is NP-hard, we assume in this paper that a joint storage reduction is in fact more intuitive and can in practice provide better results than two independent processes. Moreover, it avoids a lot of distance calculations by progressively removing useless instances during the feature pruning. While standard selection algorithms often optimize the accuracy to discriminate the set of solutions, we use in this paper a criterion based on an uncertainty measure within a nearest-neighbor graph. This choice comes from recent results that have proven that accuracy is not always the suitable criterion to optimize. In our approach, a feature or an instance is removed if its deletion improves information of the graph. Numerous experiments are presented in this paper and a statistical analysis shows the relevance of our approach, and its tolerance in the presence of noise.
Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments
Popescul, Alexandrin, Ungar, Lyle H., Pennock, David M, Lawrence, Steve
Recommender systems leverage product and community information to target products to consumers. Researchers have developed collaborative recommenders, content-based recommenders, and (largely ad-hoc) hybrid systems. We propose a unified probabilistic framework for merging collaborative and content-based recommendations. We extend Hofmann's [1999] aspect model to incorporate three-way co-occurrence data among users, items, and item content. The relative influence of collaboration data versus content data is not imposed as an exogenous parameter, but rather emerges naturally from the given data sources. Global probabilistic models coupled with standard Expectation Maximization (EM) learning algorithms tend to drastically overfit in sparse-data situations, as is typical in recommendation applications. We show that secondary content information can often be used to overcome sparsity. Experiments on data from the ResearchIndex library of Computer Science publications show that appropriate mixture models incorporating secondary data produce significantly better quality recommenders than k-nearest neighbors (k-NN). Global probabilistic models also allow more general inferences than local methods like k-NN.
Distance Metric Learning for Kernel Machines
Xu, Zhixiang, Weinberger, Kilian Q., Chapelle, Olivier
Recent work in metric learning has significantly improved the state-of-the-art in k-nearest neighbor classification. Support vector machines (SVM), particularly with RBF kernels, are amongst the most popular classification algorithms that uses distance metrics to compare examples. This paper provides an empirical analysis of the efficacy of three of the most popular Mahalanobis metric learning algorithms as pre-processing for SVM training. We show that none of these algorithms generate metrics that lead to particularly satisfying improvements for SVM-RBF classification. As a remedy we introduce support vector metric learning (SVML), a novel algorithm that seamlessly combines the learning of a Mahalanobis metric with the training of the RBF-SVM parameters. We demonstrate the capabilities of SVML on nine benchmark data sets of varying sizes and difficulties. In our study, SVML outperforms all alternative state-of-the-art metric learning algorithms in terms of accuracy and establishes itself as a serious alternative to the standard Euclidean metric with model selection by cross validation.
Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification
Noh, Yung-kyun, Park, Frank, Lee, Daniel D.
We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in k-nearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classification criteria.
Comparing K-Nearest Neighbors and Potential Energy Method in classification problem. A case study using KNN applet by E.M. Mirkes and real life benchmark data sets
K-nearest neighbors (KNN) method is used in many supervised learning classification problems. Potential Energy (PE) method is also developed for classification problems based on its physical metaphor. The energy potential used in the experiments are Yukawa potential and Gaussian Potential. In this paper, I use both applet and MATLAB program with real life benchmark data to analyze the performances of KNN and PE method in classification problems. The results show that in general, KNN and PE methods have similar performance. In particular, PE with Yukawa potential has worse performance than KNN when the density of the data is higher in the distribution of the database. When the Gaussian potential is applied, the results from PE and KNN have similar behavior. The indicators used are correlation coefficients and information gain.
Universally Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs
Sussman, Daniel L., Tang, Minh, Priebe, Carey E.
In this work we show that, using the eigen-decomposition of the adjacency matrix, we can consistently estimate latent positions for random dot product graphs provided the latent positions are i.i.d. from some distribution. If class labels are observed for a number of vertices tending to infinity, then we show that the remaining vertices can be classified with error converging to Bayes optimal using the $k$-nearest-neighbors classification rule. We evaluate the proposed methods on simulated data and a graph derived from Wikipedia.