AITopics | Nearest Neighbor Methods

Collaborating Authors

Nearest Neighbor Methods

News Overviews Instructional Materials AI-Alerts Classics

CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion

Zeng, Xianzhi, Wu, Zhuoyan, Hu, Xinjing, Shi, Xuanhua, Sun, Shixuan, Zhang, Shuhao

arXiv.org Artificial IntelligenceJun-28-2024

Approximate K Nearest Neighbor (AKNN) algorithms play a pivotal role in various AI applications, including information retrieval, computer vision, and natural language processing. Although numerous AKNN algorithms and benchmarks have been developed recently to evaluate their effectiveness, the dynamic nature of realworld data presents significant challenges that existing benchmarks fail to address. Traditional benchmarks primarily assess retrieval effectiveness in static contexts and often overlook update efficiency, which is crucial for handling continuous data ingestion. This limitation results in an incomplete assessment of an AKNN algorithm's ability to adapt to changing data patterns, thereby restricting insights into their performance in dynamic environments. To address these gaps, we introduce CANDY, a benchmark tailored for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. CANDY comprehensively assesses a wide range of AKNN algorithms, integrating advanced optimizations such as machine learning-driven inference to supplant traditional heuristic scans, and improved distance computation methods to reduce computational overhead. Our extensive evaluations across diverse datasets demonstrate that simpler AKNN baselines often surpass more complex alternatives in terms of recall and latency. These findings challenge established beliefs about the necessity of algorithmic complexity for high performance. Furthermore, our results underscore existing challenges and illuminate future research opportunities.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.19651

Country: Asia (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.69)

Add feedback

Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches

Tang, Dandan, Tong, Xin

arXiv.org Machine LearningJun-19-2024

Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates.

artificial intelligence, machine learning, png, (8 more...)

arXiv.org Machine Learning

2406.13814

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.53)

Add feedback

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning

Cai, Huaiguang

arXiv.org Artificial IntelligenceJun-18-2024

Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, advances this understanding by assessing the contribution of each datum to model accuracy. However, the resource-intensive and time-consuming nature of multiple model retraining poses challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (Conduct of Hardness and Gradient) score, which approximates the utility of each data subset on model accuracy during a single model training. By deriving the closed-form expression of the Shapley value for each data point under the CHG score utility function, we reduce the computational complexity to the equivalent of a single model retraining, an exponential improvement over existing methods. Additionally, we employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data.

artificial intelligence, machine learning, shapley value, (14 more...)

arXiv.org Artificial Intelligence

2406.1173

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Neural Dynamic Data Valuation

Liang, Zhangyong, Gao, Huanhuan, Zhang, Ji

arXiv.org Machine LearningJun-12-2024

Data constitute the foundational component of the data economy and its marketplaces. Efficient and fair data valuation has emerged as a topic of significant interest.\ Many approaches based on marginal contribution have shown promising results in various downstream tasks. However, they are well known to be computationally expensive as they require training a large number of utility functions, which are used to evaluate the usefulness or value of a given dataset for a specific purpose. As a result, it has been recognized as infeasible to apply these methods to a data marketplace involving large-scale datasets. Consequently, a critical issue arises: how can the re-training of the utility function be avoided? To address this issue, we propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV). Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state. In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states. Notably, our method requires only training once to estimate the value of all data points, significantly improving the computational efficiency. We conduct comprehensive experiments using different datasets and tasks. The results demonstrate that the proposed NDDV method outperforms the existing state-of-the-art data valuation methods in accurately identifying data points with either high or low values and is more computationally efficient.

artificial intelligence, machine learning, optimization problem, (19 more...)

arXiv.org Machine Learning

2404.19557

Country: North America > United States > Kansas > Cowley County (0.24)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Data Science (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
(2 more...)

Add feedback

On Learning what to Learn: heterogeneous observations of dynamics and establishing (possibly causal) relations among them

Sroczynski, David W., Dietrich, Felix, Koronaki, Eleni D., Talmon, Ronen, Coifman, Ronald R., Bollt, Erik, Kevrekidis, Ioannis G.

arXiv.org Artificial IntelligenceJun-10-2024

Before we attempt to learn a function between two (sets of) observables of a physical process, we must first decide what the inputs and what the outputs of the desired function are going to be. Here we demonstrate two distinct, data-driven ways of initially deciding ``the right quantities'' to relate through such a function, and then proceed to learn it. This is accomplished by processing multiple simultaneous heterogeneous data streams (ensembles of time series) from observations of a physical system: multiple observation processes of the system. We thus determine (a) what subsets of observables are common between the observation processes (and therefore observable from each other, relatable through a function); and (b) what information is unrelated to these common observables, and therefore particular to each observation process, and not contributing to the desired function. Any data-driven function approximation technique can subsequently be used to learn the input-output relation, from k-nearest neighbors and Geometric Harmonics to Gaussian Processes and Neural Networks. Two particular ``twists'' of the approach are discussed. The first has to do with the identifiability of particular quantities of interest from the measurements. We now construct mappings from a single set of observations of one process to entire level sets of measurements of the process, consistent with this single set. The second attempts to relate our framework to a form of causality: if one of the observation processes measures ``now'', while the second observation process measures ``in the future'', the function to be learned among what is common across observation processes constitutes a dynamical model for the system evolution.

artificial intelligence, machine learning, sensor 1, (16 more...)

arXiv.org Artificial Intelligence

2406.06812

Country:

North America > United States (1.00)
Europe (0.68)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Efficient k-Nearest-Neighbor Machine Translation with Dynamic Retrieval

Gao, Yan, Cao, Zhiwei, Miao, Zhongjian, Yang, Baosong, Liu, Shiyu, Zhang, Min, Su, Jinsong

arXiv.org Artificial IntelligenceJun-10-2024

To achieve non-parametric NMT domain adaptation, $k$-Nearest-Neighbor Machine Translation ($k$NN-MT) constructs an external datastore to store domain-specific translation knowledge, which derives a $k$NN distribution to interpolate the prediction distribution of the NMT model via a linear interpolation coefficient $\lambda$. Despite its success, $k$NN retrieval at each timestep leads to substantial time overhead. To address this issue, dominant studies resort to $k$NN-MT with adaptive retrieval ($k$NN-MT-AR), which dynamically estimates $\lambda$ and skips $k$NN retrieval if $\lambda$ is less than a fixed threshold. Unfortunately, $k$NN-MT-AR does not yield satisfactory results. In this paper, we first conduct a preliminary study to reveal two key limitations of $k$NN-MT-AR: 1) the optimization gap leads to inaccurate estimation of $\lambda$ for determining $k$NN retrieval skipping, and 2) using a fixed threshold fails to accommodate the dynamic demands for $k$NN retrieval at different timesteps. To mitigate these limitations, we then propose $k$NN-MT with dynamic retrieval ($k$NN-MT-DR) that significantly extends vanilla $k$NN-MT in two aspects. Firstly, we equip $k$NN-MT with a MLP-based classifier for determining whether to skip $k$NN retrieval at each timestep. Particularly, we explore several carefully-designed scalar features to fully exert the potential of the classifier. Secondly, we propose a timestep-aware threshold adjustment method to dynamically generate the threshold, which further improves the efficiency of our model. Experimental results on the widely-used datasets demonstrate the effectiveness and generality of our model.\footnote{Our code is available at \url{https://github.com/DeepLearnXMU/knn-mt-dr}.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.06073

Country:

Asia > China (0.14)
Asia > Taiwan (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.87)

Add feedback

Hinge-FM2I: An Approach using Image Inpainting for Interpolating Missing Data in Univariate Time Series

Saad, Noufel, Nadir, Maaroufi, Mehdi, Najib, Mohamed, Bakhouya

arXiv.org Machine LearningJun-8-2024

Accurate time series forecasts are crucial for various applications, such as traffic management, electricity consumption, and healthcare. However, limitations in models and data quality can significantly impact forecasts accuracy. One common issue with data quality is the absence of data points, referred to as missing data. It is often caused by sensor malfunctions, equipment failures, or human errors. This paper proposes Hinge-FM2I, a novel method for handling missing data values in univariate time series data. Hinge-FM2I builds upon the strengths of the Forecasting Method by Image Inpainting (FM2I). FM2I has proven effective, but selecting the most accurate forecasts remain a challenge. To overcome this issue, we proposed a selection algorithm. Inspired by door hinges, Hinge-FM2I drops a data point either before or after the gap (left/right-hinge), then use FM2I for imputation, and then select the imputed gap based on the lowest error of the dropped data point. Hinge-FM2I was evaluated on a comprehensive sample composed of 1356 time series, extracted from the M3 competition benchmark dataset, with missing value rates ranging from 3.57\% to 28.57\%. Experimental results demonstrate that Hinge-FM2I significantly outperforms established methods such as, linear/spline interpolation, K-Nearest Neighbors (K-NN), and ARIMA. Notably, Hinge-FM2I achieves an average Symmetric Mean Absolute Percentage Error (sMAPE) score of 5.6\% for small gaps, and up to 10\% for larger ones. These findings highlight the effectiveness of Hinge-FM2I as a promising new method for addressing missing values in univariate time series data.

data mining, data quality, machine learning, (21 more...)

arXiv.org Machine Learning

2406.06631

Country: North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.25)

Genre: Research Report > New Finding (0.88)

Industry:

Energy (1.00)
Water & Waste Management > Water Management > Lifecycle > Treatment (0.93)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation

Harwood, Ben, Dezfouli, Amir, Chades, Iadine, Sanderson, Conrad

arXiv.org Artificial IntelligenceJun-5-2024

Approximate k-Nearest Neighbour (ANN) methods are often used for mining information and aiding machine learning on large scale high-dimensional datasets. ANN methods typically differ in the index structure used for accelerating searches, resulting in various recall/runtime trade-off points. For applications with static datasets, runtime constraints and dataset properties can be used to empirically select an ANN method with suitable operating characteristics. However, for applications with dynamic datasets, which are subject to frequent online changes (like addition of new samples), there is currently no consensus as to which ANN methods are most suitable. Traditional evaluation approaches do not consider the computational costs of updating the index structure, as well as the rate and size of index updates. To address this, we empirically evaluate 5 popular ANN methods on two main applications (online data collection and online feature learning) while taking into account these considerations. Two dynamic datasets are used, derived from the SIFT1M dataset with 1 million samples and the DEEP1B dataset with 1 billion samples. The results indicate that the often used k-d trees method is not suitable on dynamic datasets as it is slower than a straightforward baseline exhaustive search method. For online data collection, the Hierarchical Navigable Small World Graphs method achieves a consistent speedup over baseline across a wide range of recall rates. For online feature learning, the Scalable Nearest Neighbours method is faster than baseline for recall rates below 75%.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.19284

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.34)

Add feedback

ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation

Huang, Chen, Jin, Yiping, Ilievski, Ilija, Lei, Wenqiang, Lv, Jiancheng

arXiv.org Artificial IntelligenceJun-1-2024

Human annotation is a time-consuming task that requires a significant amount of effort. To address this issue, interactive data annotation utilizes an annotation model to provide suggestions for humans to approve or correct. However, annotation models trained with limited labeled data are prone to generating incorrect suggestions, leading to extra human correction effort. To tackle this challenge, we propose Araida, an analogical reasoning-based approach that enhances automatic annotation accuracy in the interactive data annotation setting and reduces the need for human corrections. Araida involves an error-aware integration strategy that dynamically coordinates an annotation model and a k-nearest neighbors (KNN) model, giving more importance to KNN's predictions when predictions from the annotation model are deemed inaccurate. Empirical studies demonstrate that Araida is adaptable to different annotation tasks and models. On average, it reduces human correction labor by 11.02% compared to vanilla interactive data annotation methods.

annotation model, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2405.11912

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.46)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)

Add feedback

A Novel Pseudo Nearest Neighbor Classification Method Using Local Harmonic Mean Distance

Chen, Junzhuo, Lu, Zhixin, Kang, Shitong

arXiv.org Artificial IntelligenceMay-27-2024

In the realm of machine learning, the KNN classification algorithm is widely recognized for its simplicity and efficiency. However, its sensitivity to the K value poses challenges, especially with small sample sizes or outliers, impacting classification performance. This article introduces a novel KNN-based classifier called LMPHNN (Novel Pseudo Nearest Neighbor Classification Method Using Local Harmonic Mean Distance). LMPHNN leverages harmonic mean distance (HMD) to improve classification performance based on LMPNN rules and HMD. The classifier begins by identifying k nearest neighbors for each class and generates distinct local vectors as prototypes. Pseudo nearest neighbors (PNNs) are then created based on the local mean for each class, determined by comparing the HMD of the sample with the initial k group. Classification is determined by calculating the Euclidean distance between the query sample and PNNs, based on the local mean of these categories. Extensive experiments on various real UCI datasets and combined datasets compare LMPHNN with seven KNN-based classifiers, using precision, recall, accuracy, and F1 as evaluation metrics. LMPHNN achieves an average precision of 97%, surpassing other methods by 14%. The average recall improves by 12%, with an average accuracy enhancement of 5%. Additionally, LMPHNN demonstrates a 13% higher average F1 value compared to other methods. In summary, LMPHNN outperforms other classifiers, showcasing lower sensitivity with small sample sizes.

artificial intelligence, classifier, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2405.06238

Country:

Asia > China (0.14)
Europe > Spain (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback