Nearest Neighbor Methods
Effective Data Pruning through Score Extrapolation
Schmidt, Sebastian, Dhungel, Prasanga, Lรถffler, Christoffer, Nieth, Bjรถrn, Gรผnnemann, Stephan, Schwinn, Leo
Training advanced machine learning models demands massive datasets, resulting in prohibitive computational costs. To address this challenge, data pruning techniques identify and remove redundant training samples while preserving model performance. Yet, existing pruning techniques predominantly require a full initial training pass to identify removable samples, negating any efficiency benefits for single training runs. To overcome this limitation, we introduce a novel importance score extrapolation framework that requires training on only a small subset of data. We present two initial approaches in this framework - k-nearest neighbors and graph neural networks - to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset. We demonstrate the effectiveness of our approach for 2 state-of-the-art pruning methods (Dynamic Uncertainty and TDDS), 4 different datasets (CIFAR-10, CIFAR-100, Places-365, and ImageNet), and 3 training paradigms (supervised, unsupervised, and adversarial). Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks.
Filling in the Blanks: Applying Data Imputation in incomplete Water Metering Data
Amaxilatis, Dimitrios, Sarantakos, Themistoklis, Chatzigiannakis, Ioannis, Mylonas, Georgios
--In this work, we explore the application of recent data imputation techniques to enhance monitoring and management of water distribution networks using smart water meters, based on data derived from a real-world IoT water grid monitoring deployment. Despite the detailed data produced by such meters, data gaps due to technical issues can significantly impact operational decisions and efficiency. Our results, by comparing various imputation methods, such as k-Nearest Neighbors, MissForest, Transformers, and Recurrent Neural Networks, indicate that effective data imputation can substantially enhance the quality of the insights derived from water consumption data as we study their effect on accuracy and reliability of water metering data to provide solutions in applications like leak detection and predictive maintenance scheduling. In the era of smart cities and advanced utility management, the monitoring of water grids has become increasingly pivotal to ensuring efficient distribution, sustainability, and infrastructure reliability. However, despite their sophistication, the occurrence of missing data due to various factors--ranging from technical malfunctions to data transmission errors-- remains an open challenge that undermines the integrity and actionable insights that can be derived from the datasets produced by such infrastructure. Moreover, the significance of addressing missing data extends beyond mere data completeness. In the context of water grid monitoring, it impacts decision-making processes related to water management, leak detection, and predictive maintenance, all of which have profound implications for operational efficiency and environmental sustainability.
Fully Explainable Classification Models Using Hyperblocks
Snyder, Austin, Gallagher, Ryan, Kovalerchuk, Boris
Building on existing work with Hyperblocks, which classify data using minimum and maximum bounds for each attribute, we focus on enhancing interpretability, decreasing training time, and reducing model complexity without sacrificing accuracy. This system allows subject matter experts (SMEs) to directly inspect and understand the model's decision logic without requiring extensive machine learning expertise. To reduce Hyperblock complexity while retaining performance, we introduce a suite of algorithms for Hyperblock simplification. These include removing redundant attributes, removing redundant blocks through overlap analysis, and creating disjunctive units. These methods eliminate unnecessary parameters, dramatically reducing model size without harming classification power. We increase robustness by introducing an interpretable fallback mechanism using k-Nearest Neighbor (k-NN) classifiers for points not covered by any block, ensuring complete data coverage while preserving model transparency. Our results demonstrate that interpretable models can scale to high-dimensional, large-volume datasets while maintaining competitive accuracy. On benchmark datasets such as WBC (9-D), we achieve strong predictive performance with significantly reduced complexity. On MNIST (784-D), our method continues to improve through tuning and simplification, showing promise as a transparent alternative to black-box models in domains where trust, clarity, and control are crucial.
RADAR: Recall Augmentation through Deferred Asynchronous Retrieval
Jaspal, Amit, Dang, Qian, Ramineni, Ajantha
M odern large - scale recommender systems employ multi - stage ranking funnel (Retrieval, Pre - ranking, Ranking) to balance engagement and computational constraints (latency, CPU). However, the initial retrieval stage, often relying on efficient but less precise methods like K - Nearest Neighbors (KNN), struggles to effectively surface the most engaging items from billion - scale catalogs, particularly distinguishing highly relevant and engaging candidates from merely relevant ones. We introduce Recall Augmentation through Deferred Asynchronous Retrieval ( RADAR), a novel framework that leverages asynchronous, offline computation to pre - rank a significantly larger candidate set for users using the full complexity ranking model. These top - ranked items are stored and utilized as a high - quality retrieval source during online inference, bypassing online retrieval and pre - ranking stages for these candidates. We demonstrate through offline experiments that RADAR significantly boosts recall ( 2 X Recall @200 vs DNN retrieval baseline) by effectively combining a larger retrieved candidate set with a more powerful ranking model. Online A/B tests confirm a +0.8% lift in topline engagement metrics, validating RADAR as a practical and effective method to improve recommendation quality under strict online serving constraints.
N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion
Chin, Caleb, Khubchandani, Aashish, Maskara, Harshvardhan, Choi, Kyuseong, Feitelberg, Jacob, Gong, Albert, Paul, Manit, Sadhukhan, Tathagata, Agarwal, Anish, Dwivedi, Raaz
Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.
Efficient Quantum Approximate $k$NN Algorithm via Granular-Ball Computing
Xia, Shuyin, Tian, Xiaojiang, Yuan, Suzhen, Deng, Jeremiah D.
High time complexity is one of the biggest challenges faced by $k$-Nearest Neighbors ($k$NN). Although current classical and quantum $k$NN algorithms have made some improvements, they still have a speed bottleneck when facing large amounts of data. To address this issue, we propose an innovative algorithm called Granular-Ball based Quantum $k$NN(GB-Q$k$NN). This approach achieves higher efficiency by first employing granular-balls, which reduces the data size needed to processed. The search process is then accelerated by adopting a Hierarchical Navigable Small World (HNSW) method. Moreover, we optimize the time-consuming steps, such as distance calculation, of the HNSW via quantization, further reducing the time complexity of the construct and search process. By combining the use of granular-balls and quantization of the HNSW method, our approach manages to take advantage of these treatments and significantly reduces the time complexity of the $k$NN-like algorithms, as revealed by a comprehensive complexity analysis.
HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection
Data serves as the fundamental basis for advancing deep learning. Therefore, exploring the methods for effectively using models like LLMs to generate synthetic tabular data, which is privacy-preserving but similar to original one, is urgent.In this paper, we introduce a new framework HARMONIC for tabular data generation and evaluation by LLMs. In the data generation part of our framework, we employ fine-tuning to generate tabular data and enhance privacy rather than continued pre-training which is often used by previous small-scale LLM-based methods. In particular, we construct an instruction fine-tuning dataset based on the idea of the k-nearest neighbors algorithm to inspire LLMs to discover inter-row relationships. By such fine-tuning, LLMs are trained to remember the format and connections of the data rather than the data itself, which reduces the risk of privacy leakage.
Multi-scale Consistency for Robust 3D Registration via Hierarchical Sinkhorn Tree
We study the problem of retrieving accurate correspondence through multi-scale consistency (MSC) for robust point cloud registration. Existing works in a coarse-to-fine manner either suffer from severe noisy correspondences caused by unreliable coarse matching or struggle to form outlier-free coarse-level correspondence sets. To tackle this, we present Hierarchical Sinkhorn Tree (HST), a pruned tree structure designed to hierarchically measure the local consistency of each coarse correspondence across multiple feature scales, thereby filtering out the local dissimilar ones. In this way, we convert the modeling of MSC for each correspondence into a BFS traversal with pruning of a K-ary tree rooted at the superpoint, with its K nearest neighbors in the feature pyramid serving as child nodes. To achieve efficient pruning and accurate vicinity characterization, we further propose a novel overlap-aware Sinkhorn Distance, which retains only the most likely overlapping points for local measurement and next level exploration.
Persistent Homology for High-dimensional Data Based on Spectral Methods
Persistent homology is a popular computational tool for analyzing the topology of point clouds, such as the presence of loops or voids. However, many real-world datasets with low intrinsic dimensionality reside in an ambient space of much higher dimensionality. We show that in this case traditional persistent homology becomes very sensitive to noise and fails to detect the correct topology. The same holds true for existing refinements of persistent homology. As a remedy, we find that spectral distances on the k-nearest-neighbor graph of the data, such as diffusion distance and effective resistance, allow to detect the correct topology even in the presence of high-dimensional noise.
Meta-Neighborhoods
Making an adaptive prediction based on input is an important ability for general artificial intelligence. In this work, we step forward in this direction and propose a semi-parametric method, Meta-Neighborhoods, where predictions are made adaptively to the neighborhood of the input. We show that Meta-Neighborhoods is a generalization of k-nearest-neighbors. Due to the simpler manifold structure around a local neighborhood, Meta-Neighborhoods represent the predictive distribution p(y x) more accurately. To reduce memory and computation overheads, we propose induced neighborhoods that summarize the training data into a much smaller dictionary.