AITopics | Nearest Neighbor Methods

Collaborating Authors

Nearest Neighbor Methods

News Overviews Instructional Materials AI-Alerts Classics

Mitigating the Curse of Dimensionality for Exact kNN Retrieval

Schuh, Michael A. (Montana State University) | Wylie, Tim (University of Alberta) | Angryk, Rafal A. (Georgia State University)

AAAI ConferencesMay-7-2014

Efficient data indexing and exact k-nearest-neighbor (kNN) retrieval are still challenging tasks in high-dimensional spaces. This work highlights the difficulties of indexing in high-dimensional and tightly-clustered dataspaces by exploring several important tunable parameters for optimizing kNN query performance using the iDistance and iDStar algorithms. We experiment on real and synthetic datasets of varying size, cluster density, and dimensionality, and compare performance primarily through filter-and-refine efficiency and execution time. Results show great variability over parameter values and provide new insights and justifications in support of prior best-use practices. Local segmentation with iDStar consistently outperforms iDistance in any clustered space below 256 dimensions, setting a new benchmark for efficient and exact kNN retrieval in high-dimensional spaces. We propose several directions of future work to further increase performance in high-dimensional real-world settings.

dimensionality, exact knn retrieval, mitigating

AAAI Conferences

The Twenty-Seventh International Flairs Conference

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (0.40)

Add feedback

Toward Building Automatic Affect Recognition Machine Using Acoustics Features

Marpaung, Andreas H. (University of Central Florida) | Gonzalez, Avelino (University of Central Florida)

AAAI ConferencesMay-7-2014

Research in the field of Affective Computing on affect recognition through speech has used a “fishing expedition” approach. Although some frameworks could achieve certain success rates, many of these approaches missed the theory behind the underlying voice and speech production mechanism. In this work, we found some correlation among the acoustic parameters (paralinguistic/non-verbal speech content) in the physiological mechanism of voice production. Furthermore, we also found some correlation when analyzing their relationships statistically. Aligned with this finding, we implemented our framework using the K-Nearest Neighbors (KNN) algorithm. Although our work is still in its infancy, we believe this context-free approach will bring us forward toward creating an intelligent agent with affect recognition ability. This paper describes the problem, our approach and our results.

acoustic feature, recognition machine

AAAI Conferences

The Twenty-Seventh International Flairs Conference

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.53)

Add feedback

Fast Exact Search in Hamming Space with Multi-Index Hashing

Norouzi, Mohammad, Punjani, Ali, Fleet, David J.

arXiv.org Artificial IntelligenceApr-24-2014

There is growing interest in representing image data and feature descriptors using compact binary codes for fast near neighbor search. Although binary codes are motivated by their use as direct indices (addresses) into a hash table, codes longer than 32 bits are not being used as such, as it was thought to be ineffective. We introduce a rigorous way to build multiple hash tables on binary code substrings that enables exact k-nearest neighbor search in Hamming space. The approach is storage efficient and straightforward to implement. Theoretical analysis shows that the algorithm exhibits sub-linear run-time behavior for uniformly distributed codes. Empirical results show dramatic speedups over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits.

information retrieval, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

1307.2982

Country: North America (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.66)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)

Add feedback

Ensemble Committees for Stock Return Classification and Prediction

Brofos, James

arXiv.org Machine LearningApr-5-2014

This paper considers a portfolio trading strategy formulated by algorithms in the field of machine learning. The profitability of the strategy is measured by the algorithm's capability to consistently and accurately identify stock indices with positive or negative returns, and to generate a preferred portfolio allocation on the basis of a learned model. Stocks are characterized by time series data sets consisting of technical variables that reflect market conditions in a previous time interval, which are utilized produce binary classification decisions in subsequent intervals. The learned model is constructed as a committee of random forest classifiers, a nonlinear support vector machine classifier, a relevance vector machine classifier, and a constituent ensemble of k-nearest neighbors classifiers. This selection of algorithms is appealing for two reasons: first, there is strikingly little research in economic time-series forecasting that employs learners beyond neural networks and clustering algorithms, and this construction offers a viable alternative; second, this selection incorporates an array of techniques that have both theoretically optimal classification properties and high empirical success rates in areas outside of finance, in addition to offering a mixture of parametric and nonparametric models. The ensemble committee is augmented by a boosting meta-algorithm and feature selection is performed by a supervised Relief algorithm. The Global Industry Classification Standard (GICS) is used to explore the ensemble model's efficacy within the context of various fields of investment including Energy, Materials, Financials, and Information Technology. Data from 2006 to 2012, inclusive, are considered, which are chosen for providing a range of market circumstances for evaluating the model. The model is observed to achieve an accuracy of approximately 70% when predicting stock price returns three months in advance.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Machine Learning

1404.1492

Genre: Research Report (0.50)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.70)

Add feedback

Embedding Graphs under Centrality Constraints for Network Visualization

Baingana, Brian, Giannakis, Georgios B.

arXiv.org Machine LearningJan-17-2014

In this case, the vertex dissimilarity structure is preserved through pairwise distance metrics between vertices. Principal component analysis (PCA) of the graph adjacency matrix is advocated in [3], leading to a spectral embedding whose vertices correspond to entries of the leading component vectors. The structure preserving embedding algorithm [4] solves a semidefinite program with linear topology constraints so that a nearest neighbor algorithm can recover the graph edges from the embedding. Visual analytics approaches developed in [7] and [12] emphasize community structures with applications to community browsing in graphs. Concentric graph layouts developed in [39] and [30] capture notions of node hierarchy by placing the highest ranked nodes at the center of the embedding. Although the graph embedding problem has been studied for years, development of fast and optimal visualization algorithms with hierarchical constraints is challenging and existing methods typically resort to heuristic approaches. The growing interest in analysis of very large networks has prioritized the need for effectively capturing hierarchy over aesthetic appeal in visualization. For instance, a hierarchy-aware visual analysis of a global computer network is naturally more useful to security experts trying to protect the most critical nodes from a viral infection. Layouts of metro-transit networks that clearly show terminals routing the bulk of traffic convey a better picture about the most critical nodes in the event of a terrorist attack.

constraint, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

1401.4408

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre: Research Report (0.64)

Industry:

Law Enforcement & Public Safety > Terrorism (0.54)
Information Technology (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.48)

Add feedback

Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Zhang, Yu

Neural Information Processing SystemsDec-31-2013

All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multi-task classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the k-nearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods.

artificial intelligence, coordinate descent method, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
North America > Canada > British Columbia (0.14)

Industry: Education (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.88)

Add feedback

Rapid Distance-Based Outlier Detection via Sampling

Sugiyama, Mahito, Borgwardt, Karsten

Neural Information Processing SystemsDec-31-2013

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search.

artificial intelligence, data mining, machine learning, (16 more...)

Neural Information Processing Systems

Country: Europe > Germany (0.28)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.55)

Add feedback

Density estimation from unweighted k-nearest neighbor graphs: a roadmap

Luxburg, Ulrike Von, Alamgir, Morteza

Neural Information Processing SystemsDec-31-2013

Consider an unweighted k-nearest neighbor graph on n points that have been sampled i.i.d. from some unknown density p on R^d. We prove how one can estimate the density p just from the unweighted adjacency matrix of the graph, without knowing the points themselves or their distance or similarity scores. The key insights are that local differences in link numbers can be used to estimate some local function of p, and that integrating this function along shortest paths leads to an estimate of the underlying density.

artificial intelligence, graph, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (1.00)

Add feedback

Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

Sharpnack, James, Krishnamurthy, Akshay, Singh, Aarti

arXiv.org Machine LearningDec-11-2013

The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lovasz extended scan statistic (LESS) that uses submodularity to approximate the intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, k-nearest neighbor graphs, and epsilon-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1312.3291

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Epidemiology (0.54)
Energy > Power Industry (0.34)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Convergence of Nearest Neighbor Pattern Classification with Selective Sampling

Joseph, Shaun N., Bakr, Seif Omar Abu, Lugo, Gabriel

arXiv.org Machine LearningSep-6-2013

In the panoply of pattern classification techniques, few enjoy the intuitive appeal and simplicity of the nearest neighbor rule: given a set of samples in some metric domain space whose value under some function is known, we estimate the function anywhere in the domain by giving the value of the nearest sample per the metric. More generally, one may use the modal value of the m nearest samples, where m is a fixed positive integer (although m=1 is known to be admissible in the sense that no larger value is asymptotically superior in terms of prediction error). The nearest neighbor rule is nonparametric and extremely general, requiring in principle only that the domain be a metric space. The classic paper on the technique, proving convergence under independent, identically-distributed (iid) sampling, is due to Cover and Hart (1967). Because taking samples is costly, there has been much research in recent years on selective sampling, in which each sample is selected from a pool of candidates ranked by a heuristic; the heuristic tries to guess which candidate would be the most "informative" sample. Lindenbaum et al. (2004) apply selective sampling to the nearest neighbor rule, but their approach sacrifices the austere generality of Cover and Hart; furthermore, their heuristic algorithm is complex and computationally expensive. Here we report recent results that enable selective sampling in the original Cover-Hart setting. Our results pose three selection heuristics and prove that their nearest neighbor rule predictions converge to the true pattern. Two of the algorithms are computationally cheap, with complexity growing linearly in the number of samples. We believe that these results constitute an important advance in the art.

nearest neighbor rule, neighbor, voronoi neighbor, (13 more...)

arXiv.org Machine Learning

1309.1761

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Massachusetts > Worcester County > Fitchburg (0.04)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.48)

Add feedback