Nearest Neighbor Methods
Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
Sharpnack, James, Krishnamurthy, Akshay, Singh, Aarti
The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lovasz extended scan statistic (LESS) that uses submodularity to approximate the intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, k-nearest neighbor graphs, and epsilon-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds.
Convergence of Nearest Neighbor Pattern Classification with Selective Sampling
Joseph, Shaun N., Bakr, Seif Omar Abu, Lugo, Gabriel
In the panoply of pattern classification techniques, few enjoy the intuitive appeal and simplicity of the nearest neighbor rule: given a set of samples in some metric domain space whose value under some function is known, we estimate the function anywhere in the domain by giving the value of the nearest sample per the metric. More generally, one may use the modal value of the m nearest samples, where m is a fixed positive integer (although m=1 is known to be admissible in the sense that no larger value is asymptotically superior in terms of prediction error). The nearest neighbor rule is nonparametric and extremely general, requiring in principle only that the domain be a metric space. The classic paper on the technique, proving convergence under independent, identically-distributed (iid) sampling, is due to Cover and Hart (1967). Because taking samples is costly, there has been much research in recent years on selective sampling, in which each sample is selected from a pool of candidates ranked by a heuristic; the heuristic tries to guess which candidate would be the most "informative" sample. Lindenbaum et al. (2004) apply selective sampling to the nearest neighbor rule, but their approach sacrifices the austere generality of Cover and Hart; furthermore, their heuristic algorithm is complex and computationally expensive. Here we report recent results that enable selective sampling in the original Cover-Hart setting. Our results pose three selection heuristics and prove that their nearest neighbor rule predictions converge to the true pattern. Two of the algorithms are computationally cheap, with complexity growing linearly in the number of samples. We believe that these results constitute an important advance in the art.
A KNN Based Kalman Filter Gaussian Process Regression
Wang, Yali (Laval University) | Chaib-draa, Brahim (Laval University)
The standard Gaussian process (GP) regression is often intractable when a data set is large or spatially nonstationary. In this paper, we address these challenging data properties by designing a novel K nearest neighbor based Kalman filter Gaussian process (KNN-KFGP) regression. Based on a state space model established by the KNN driven data grouping, our KNN-KFGP recursively filters out the latent function values in a computationally efficient and accurate Kalman filtering framework. Moreover, KNN allows each test point to find its strongly correlated local training subset, so our KNN-KFGP provides a suitable way to deal with spatial nonstationary problems. We evaluate the performance of our KNN-KFGP on several synthetic and real data sets to show its validity.
Unlearning from Demonstration
Sullivan, Keith (George Mason University) | ElMolla, Ahmed (George Mason University) | Squires, Bill (George Mason University) | Luke, Sean (George Mason University)
When doing learning from demonstration, it is often the case that the demonstrator provides corrective examples to fix errant behavior by the agent or robot. We present a set of algorithms which use this corrective data to identify and remove noisy examples in datasets which caused errant classifications, and ultimately errant behavior. The objective is to actually modify the source datasets rather than solely rely on the noise-insensitivity of the classification algorithm. This is particularly useful in the sparse datasets often found in learning from demonstration experiments. Our approach tries to distinguish between noisy misclassification and mere undersampling of the learning space. If errors are a result of misclassification, we potentially remove the responsible points and update the classifier. We demonstrate our method on UCI Machine Learning datasets at different levels of sparsity and noise, using decision trees, K-Nearest-Neighbor, and support vector machines.
Supervised Metric Learning with Generalization Guarantees
The crucial importance of metrics in machine learning algorithms has led to an increasing interest in optimizing distance and similarity functions, an area of research known as metric learning. When data consist of feature vectors, a large body of work has focused on learning a Mahalanobis distance. Less work has been devoted to metric learning from structured objects (such as strings or trees), most of it focusing on optimizing a notion of edit distance. We identify two important limitations of current metric learning approaches. First, they allow to improve the performance of local algorithms such as k-nearest neighbors, but metric learning for global algorithms (such as linear classifiers) has not been studied so far. Second, the question of the generalization ability of metric learning methods has been largely ignored. In this thesis, we propose theoretical and algorithmic contributions that address these limitations. Our first contribution is the derivation of a new kernel function built from learned edit probabilities. Our second contribution is a novel framework for learning string and tree edit similarities inspired by the recent theory of (e,g,t)-good similarity functions. Using uniform stability arguments, we establish theoretical guarantees for the learned similarity that give a bound on the generalization error of a linear classifier built from that similarity. In our third contribution, we extend these ideas to metric learning from feature vectors by proposing a bilinear similarity learning method that efficiently optimizes the (e,g,t)-goodness. Generalization guarantees are derived for our approach, highlighting that our method minimizes a tighter bound on the generalization error of the classifier. Our last contribution is a framework for establishing generalization bounds for a large class of existing metric learning algorithms based on a notion of algorithmic robustness.
Learning an Integrated Distance Metric for Comparing Structure of Complex Networks
Aliakbary, Sadegh, Motallebi, Sadegh, Habibi, Jafar, Movaghar, Ali
Graph comparison plays a major role in many network applications. We often need a similarity metric for comparing networks according to their structural properties. Various network features - such as degree distribution and clustering coefficient - provide measurements for comparing networks from different points of view, but a global and integrated distance metric is still missing. In this paper, we employ distance metric learning algorithms in order to construct an integrated distance metric for comparing structural properties of complex networks. According to natural witnesses of network similarities (such as network categories) the distance metric is learned by the means of a dataset of some labeled real networks. For evaluating our proposed method which is called NetDistance, we applied it as the distance metric in K-nearest-neighbors classification. Empirical results show that NetDistance outperforms previous methods, at least 20 percent, with respect to precision.
Model Predictive Control with Uncertainty in Human Driven Systems
Styler, Alexander David (Carnegie Mellon University) | Nourbakhsh, Illah Reza (Carnegie Mellon University)
Human driven systems present a unique optimization challenge for robot control. Generally, operators of these systems behave rationally given environmental factors and desired goals. However, information available to subsystem controllers is often incomplete, and the operator becomes more difficult to model without this input information. In this work we present a data-driven, nonparametric model to capture both expectation and uncertainty of the upcoming duty for a subsystem controller. This model is a modified k-nearest neighbor regressor used to generate weighted samples from a distribution of upcoming duty, which are then exploited to generate an optimal control. We test the model on a simulated heterogeneous energy pack manager in an Electric Vehicle operated by a human driver. For this domain, upcoming load on the energy pack strongly affects the optimal use and charging strategy of the pack. Given incomplete information, there is a natural uncertainty in upcoming duty due to traffic, destination, signage, and other factors. We test against a dataset of real driving data gathered from volunteers, and compare the results other models and the optimal upper bound.
Empirical Comparison of Multi-Label Classification Algorithms
Tawiah, Clifford (University of Central Arkansas) | Sheng, Victor (University of Central Arkansas)
Multi-label classifications exist in many real world applications. This paper empirically studies the performance of a variety of multi-label classification algorithms. Some of them are developed based on problem transformation. Some of them are developed based on adaption. Our experimental results show that the adaptive Multi-Label K-Nearest Neighbor performs the best, followed by Random k-Label Set, followed by Classifier Chain and Binary Relevance. Adaboost.MH performs the worst, followed by Pruned Problem Transformation. Our experimental results also provide us the confidence of the correlations among multi-labels. These insights shed light for future research directions on multi-label classifications.
SALL-E: Situated Agent for Language Learning
Perera, Ian (University of Rochester) | Allen, James F. (University of Rochester)
We describe ongoing research towards building a cognitively plausible system for near one-shot learning of the meanings of attribute words and object names, by grounding them in a sensory model. The system learns incrementally from human demonstrations recorded with the Microsoft Kinect, in which the demonstrator can use unrestricted natural language descriptions. We achieve near-one shot learning of simple objects and attributes by focusing solely on examples where the learning agent is confident, ignoring the rest of the data. We evaluate the system's learning ability by having it generate descriptions of presented objects, including objects it has never seen before, and comparing the system response against collected human descriptions of the same objects. We propose that our method of retrieving object examples with a k-nearest neighbor classifier using Mahalanobis distance corresponds to a cognitively plausible representation of objects. Our initial results show promise for achieving rapid, near one-shot, incremental learning of word meanings.
Walking on Minimax Paths for k-NN Search
Kim, Kye-Hyeon (Pohang University of Science and Technology) | Choi, Seungjin (Pohang University of Science and Technology)
Link-based dissimilarity measures, such as shortest path or Euclidean commute time distance, base their distance on paths between nodes of a weighted graph. These measures are known to be better suited to data manifold with nonconvex-shaped clusters, compared to Euclidean distance, so that k -nearest neighbor (NN) search is improved in such metric spaces. In this paper we present a new link-based dissimilarity measure based on minimax paths between nodes. Two main benefits of minimax path-based dissimilarity measure are: (1) only a subset of paths is considered to make it scalable, while Euclidean commute time distance considers all possible paths; (2) it better captures nonconvex-shaped cluster structure, compared to shortest path distance. We define the total cost assigned to a path between nodes as L p norm of intermediate costs of edges involving the path, showing that minimax path emerges from our L p norm over paths framework. We also define minimax distance as the intermediate cost of the longest edge on the minimax path, then present a greedy algorithm to compute k smallest minimax distances between a query and N data points in O(log N + k log k) time. Numerical experiments demonstrate that our minimax k-NN algorithm reduce the search time by several orders of magnitude, compared to existing methods, while the quality of k -NN search is significantly improved over Euclidean distance.