Goto

Collaborating Authors

 Nearest Neighbor Methods


Finite-Sample Analysis of Fixed-k Nearest Neighbor Density Functional Estimators

Neural Information Processing Systems

We provide finite-sample analysis of a general framework for using k-nearest neighbor statistics to estimate functionals of a nonparametric continuous probability density, including entropies and divergences. Rather than plugging a consistent density estimate (which requires k as the sample size n) into the functional of interest, the estimators we consider fix k and perform a bias correction. This is more efficient computationally, and, as we show in certain cases, statistically, leading to faster convergence rates. Our framework unifies several previous estimators, for most of which ours are the first finite sample guarantees.


Kfir Y. Levy The Voleon Group

Neural Information Processing Systems

The weighted k-nearest neighbors algorithm is one of the most fundamental nonparametric methods in pattern recognition and machine learning. The question of setting the optimal number of neighbors as well as the optimal weights has received much attention throughout the years, nevertheless this problem seems to have remained unsettled. In this paper we offer a simple approach to locally weighted regression/classification, where we make the bias-variance tradeoff explicit. Our formulation enables us to phrase a notion of optimal weights, and to efficiently find these weights as well as the optimal number of neighbors efficiently and adaptively, for each data point whose value we wish to estimate. The applicability of our approach is demonstrated on several datasets, showing superior performance over standard locally weighted methods.


Active Nearest-Neighbor Learning in Metric Spaces

Neural Information Processing Systems

We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is significantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efficient active model-selection procedure.


'One size doesn't fit all': Learning how many Examples to use for In-Context Learning for Improved Text Classification

arXiv.org Artificial Intelligence

Predictive models in natural language processing (NLP) have evolved from training models from scratch to fine-tuning pre-trained models with labelled data. An extreme form of this fine-tuning involves in-context learning (ICL), where the output of a pre-trained generative model (frozen decoder parameters) is controlled only with variations in the input strings (called instructions or prompts). An important component of ICL is the use of a small number of labelled data instances as examples in the prompt. While existing work uses a static number of examples during inference for each data instance, in this paper we propose a novel methodology of dynamically adapting the number of examples as per the data. This is analogous to the use of a variable-sized neighborhood in k-nearest neighbors (k-NN) classifier. In our proposed workflow of adaptive ICL (AICL), the number of demonstrations to employ during the inference on a particular data instance is predicted by the Softmax posteriors of a classifier. The parameters of this classifier are fitted on the optimal number of examples in ICL required to correctly infer the label of each instance in the training set with the hypothesis that a test instance that is similar to a training instance should use the same (or a closely matching) number of few-shot examples. Our experiments show that our AICL method results in improvement in text classification task on several standard datasets.


Securing Transactions: A Hybrid Dependable Ensemble Machine Learning Model using IHT-LR and Grid Search

arXiv.org Artificial Intelligence

Financial institutions and businesses face an ongoing challenge from fraudulent transactions, prompting the need for effective detection methods. Detecting credit card fraud is crucial for identifying and preventing unauthorized transactions.Timely detection of fraud enables investigators to take swift actions to mitigate further losses. However, the investigation process is often time-consuming, limiting the number of alerts that can be thoroughly examined each day. Therefore, the primary objective of a fraud detection model is to provide accurate alerts while minimizing false alarms and missed fraud cases. In this paper, we introduce a state-of-the-art hybrid ensemble (ENS) dependable Machine learning (ML) model that intelligently combines multiple algorithms with proper weighted optimization using Grid search, including Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), and Multilayer Perceptron (MLP), to enhance fraud identification. To address the data imbalance issue, we employ the Instant Hardness Threshold (IHT) technique in conjunction with Logistic Regression (LR), surpassing conventional approaches. Our experiments are conducted on a publicly available credit card dataset comprising 284,807 transactions. The proposed model achieves impressive accuracy rates of 99.66%, 99.73%, 98.56%, and 99.79%, and a perfect 100% for the DT, RF, KNN, MLP and ENS models, respectively. The hybrid ensemble model outperforms existing works, establishing a new benchmark for detecting fraudulent transactions in high-frequency scenarios. The results highlight the effectiveness and reliability of our approach, demonstrating superior performance metrics and showcasing its exceptional potential for real-world fraud detection applications.


Towards MLOps: A DevOps Tools Recommender System for Machine Learning System

arXiv.org Artificial Intelligence

Applying DevOps practices to machine learning system is termed as MLOps and machine learning systems evolve on new data unlike traditional systems on requirements. The objective of MLOps is to establish a connection between different open-source tools to construct a pipeline that can automatically perform steps to construct a dataset, train the machine learning model and deploy the model to the production as well as store different versions of model and dataset. Benefits of MLOps is to make sure the fast delivery of the new trained models to the production to have accurate results. Furthermore, MLOps practice impacts the overall quality of the software products and is completely dependent on open-source tools and selection of relevant open-source tools is considered as challenged while a generalized method to select an appropriate open-source tools is desirable. In this paper, we present a framework for recommendation system that processes the contextual information (e.g., nature of data, type of the data) of the machine learning project and recommends a relevant toolchain (tech-stack) for the operationalization of machine learning systems. To check the applicability of the proposed framework, four different approaches i.e., rule-based, random forest, decision trees and k-nearest neighbors were investigated where precision, recall and f-score is measured, the random forest out classed other approaches with highest f-score value of 0.66.


Generating Diverse Translation with Perturbed kNN-MT

arXiv.org Artificial Intelligence

Generating multiple translation candidates would enable users to choose the one that satisfies their needs. Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem -- the model underestimates a prediction that is largely different from the training data, even if that prediction is likely. This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT). Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem. Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation's magnitude.


Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

arXiv.org Artificial Intelligence

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.


Modeling Freight Mode Choice Using Machine Learning Classifiers: A Comparative Study Using the Commodity Flow Survey (CFS) Data

arXiv.org Artificial Intelligence

This study explores the usefulness of machine learning classifiers for modeling freight mode choice. We investigate eight commonly used machine learning classifiers, namely Naive Bayes, Support Vector Machine, Artificial Neural Network, K-Nearest Neighbors, Classification and Regression Tree, Random Forest, Boosting and Bagging, along with the classical Multinomial Logit model. US 2012 Commodity Flow Survey data are used as the primary data source; we augment it with spatial attributes from secondary data sources. The performance of the classifiers is compared based on prediction accuracy results. The current research also examines the role of sample size and training-testing data split ratios on the predictive ability of the various approaches. In addition, the importance of variables is estimated to determine how the variables influence freight mode choice. The results show that the tree-based ensemble classifiers perform the best. Specifically, Random Forest produces the most accurate predictions, closely followed by Boosting and Bagging. With regard to variable importance, shipment characteristics, such as shipment distance, industry classification of the shipper and shipment size, are the most significant factors for freight mode choice decisions.


Incorporating Crowdsourced Annotator Distributions into Ensemble Modeling to Improve Classification Trustworthiness for Ancient Greek Papyri

arXiv.org Artificial Intelligence

Performing classification on noisy, crowdsourced image datasets can prove challenging even for the best neural networks. Two issues which complicate the problem on such datasets are class imbalance and ground-truth uncertainty in labeling. The AL-ALL and AL-PUB datasets - consisting of tightly cropped, individual characters from images of ancient Greek papyri - are strongly affected by both issues. The application of ensemble modeling to such datasets can help identify images where the ground-truth is questionable and quantify the trustworthiness of those samples. As such, we apply stacked generalization consisting of nearly identical ResNets with different loss functions: one utilizing sparse cross-entropy (CXE) and the other Kullback-Liebler Divergence (KLD). Both networks use labels drawn from a crowd-sourced consensus. This consensus is derived from a Normalized Distribution of Annotations (NDA) based on all annotations for a given character in the dataset. For the second network, the KLD is calculated with respect to the NDA. For our ensemble model, we apply a k-nearest neighbors model to the outputs of the CXE and KLD networks. Individually, the ResNet models have approximately 93% accuracy, while the ensemble model achieves an accuracy of > 95%, increasing the classification trustworthiness. We also perform an analysis of the Shannon entropy of the various models' output distributions to measure classification uncertainty. Our results suggest that entropy is useful for predicting model misclassifications.