Goto

Collaborating Authors

 Nearest Neighbor Methods


Towards the Best Solution for Complex System Reliability: Can Statistics Outperform Machine Learning?

arXiv.org Artificial Intelligence

Studying the reliability of complex systems using machine learning techniques involves facing a series of technical and practical challenges, ranging from the intrinsic nature of the system and data to the difficulties in modeling and effectively deploying models in real-world scenarios. This study compares the effectiveness of classical statistical techniques and machine learning methods for improving complex system analysis in reliability assessments. We aim to demonstrate that classical statistical algorithms often yield more precise and interpretable results than black-box machine learning approaches in many practical applications. The evaluation is conducted using both real-world data and simulated scenarios. We report the results obtained from statistical modeling algorithms, as well as from machine learning methods including neural networks, K-nearest neighbors, and random forests.


Enhancing Carbon Emission Reduction Strategies using OCO and ICOS data

arXiv.org Artificial Intelligence

We propose a methodology to enhance local CO2 monitoring by integrating satellite data from the Orbiting Carbon Observatories (OCO-2 and OCO-3) with ground level observations from the Integrated Carbon Observation System (ICOS) and weather data from the ECMWF Reanalysis v5 (ERA5). Unlike traditional methods that downsample national data, our approach uses multimodal data fusion for high-resolution CO2 estimations. We employ weighted K-nearest neighbor (KNN) interpolation with machine learning models to predict ground level CO2 from satellite measurements, achieving a Root Mean Squared Error of 3.92 ppm. Our results show the effectiveness of integrating diverse data sources in capturing local emission patterns, highlighting the value of high-resolution atmospheric transport models. The developed model improves the granularity of CO2 monitoring, providing precise insights for targeted carbon mitigation strategies, and represents a novel application of neural networks and KNN in environmental monitoring, adaptable to various regions and temporal scales.


GABIC: Graph-based Attention Block for Image Compression

arXiv.org Artificial Intelligence

While standardized codecs like JPEG and HEVC-intra represent the industry standard in image compression, neural Learned Image Compression (LIC) codecs represent a promising alternative. In detail, integrating attention mechanisms from Vision Transformers into LIC models has shown improved compression efficiency. However, extra efficiency often comes at the cost of aggregating redundant features. This work proposes a Graph-based Attention Block for Image Compression (GABIC), a method to reduce feature redundancy based on a k-Nearest Neighbors enhanced attention mechanism. Our experiments show that GABIC outperforms comparable methods, particularly at high bit rates, enhancing compression performance.


Statistical Guarantees of Distributed Nearest Neighbor Classification

Neural Information Processing Systems

Nearest neighbor is a popular nonparametric method for classification and regression with many appealing properties. In the big data era, the sheer volume and spatial/temporal disparity of big data may prohibit centrally processing and storing the data. This has imposed considerable hurdle for nearest neighbor predictions since the entire training data must be memorized. One effective way to overcome this issue is the distributed learning framework. Through majority voting, the distributed nearest neighbor classifier achieves the same rate of convergence as its oracle version in terms of the regret, up to a multiplicative constant that depends solely on the data dimension. The multiplicative difference can be eliminated by replacing majority voting with the weighted voting scheme. In addition, we provide sharp theoretical upper bounds of the number of subsamples in order for the distributed nearest neighbor classifier to reach the optimal convergence rate. It is interesting to note that the weighted voting scheme allows a larger number of subsamples than the majority voting one. Our findings are supported by numerical studies.


High-Resolution Flood Probability Mapping Using Generative Machine Learning with Large-Scale Synthetic Precipitation and Inundation Data

arXiv.org Artificial Intelligence

High-resolution flood probability maps are essential for addressing the limitations of existing flood risk assessment approaches but are often limited by the availability of historical event data. Also, producing simulated data needed for creating probabilistic flood maps using physics-based models involves significant computation and time effort inhibiting the feasibility. To address this gap, this study introduces Flood-Precip GAN (Flood-Precipitation Generative Adversarial Network), a novel methodology that leverages generative machine learning to simulate large-scale synthetic inundation data to produce probabilistic flood maps. With a focus on Harris County, Texas, Flood-Precip GAN begins with training a cell-wise depth estimator using a limited number of physics-based model-generated precipitation-flood events. This model, which emphasizes precipitation-based features, outperforms universal models. Subsequently, a Generative Adversarial Network (GAN) with constraints is employed to conditionally generate synthetic precipitation records. Strategic thresholds are established to filter these records, ensuring close alignment with true precipitation patterns. For each cell, synthetic events are smoothed using a K-nearest neighbors algorithm and processed through the depth estimator to derive synthetic depth distributions. By iterating this procedure and after generating 10,000 synthetic precipitation-flood events, we construct flood probability maps in various formats, considering different inundation depths. Validation through similarity and correlation metrics confirms the fidelity of the synthetic depth distributions relative to true data. Flood-Precip GAN provides a scalable solution for generating synthetic flood depth data needed to create high-resolution flood probability maps, significantly enhancing flood preparedness and mitigation efforts.


Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework

arXiv.org Artificial Intelligence

Harnessing human movements to command an Unmanned Aerial Vehicle (UAV) holds the potential to revolutionize their deployment, rendering it more intuitive and user-centric. In this research, we introduce a novel methodology adept at classifying three-dimensional human actions, leveraging them to coordinate on-field with a UAV. Utilizing a stereo camera, we derive both RGB and depth data, subsequently extracting three-dimensional human poses from the continuous video feed. This data is then processed through our proposed k-nearest neighbour classifier, the results of which dictate the behaviour of the UAV. It also includes mechanisms ensuring the robot perpetually maintains the human within its visual purview, adeptly tracking user movements. We subjected our approach to rigorous testing involving multiple tests with real robots. The ensuing results, coupled with comprehensive analysis, underscore the efficacy and inherent advantages of our proposed methodology.


dzNLP at NADI 2024 Shared Task: Multi-Classifier Ensemble with Weighted Voting and TF-IDF Features

arXiv.org Artificial Intelligence

This paper presents the contribution of our dzNLP team to the NADI 2024 shared task, specifically in Subtask 1 - Multi-label Country-level Dialect Identification (MLDID) (Closed Track). We explored various configurations to address the challenge: in Experiment 1, we utilized a union of n-gram analyzers (word, character, character with word boundaries) with different n-gram values; in Experiment 2, we combined a weighted union of Term Frequency-Inverse Document Frequency (TF-IDF) features with various weights; and in Experiment 3, we implemented a weighted major voting scheme using three classifiers: Linear Support Vector Classifier (LSVC), Random Forest (RF), and K-Nearest Neighbors (KNN). Our approach, despite its simplicity and reliance on traditional machine learning techniques, demonstrated competitive performance in terms of F1-score and precision. Notably, we achieved the highest precision score of 63.22% among the participating teams. However, our overall F1 score was approximately 21%, significantly impacted by a low recall rate of 12.87%. This indicates that while our models were highly precise, they struggled to recall a broad range of dialect labels, highlighting a critical area for improvement in handling diverse dialectal variations.


Evaluating the performance-deviation of itemKNN in RecBole and LensKit

arXiv.org Artificial Intelligence

This study examines the performance of item-based k-Nearest Neighbors (ItemKNN) algorithms in the RecBole and LensKit recommender system libraries. Using four data sets (Anime, Modcloth, ML-100K, and ML-1M), we assess each library's efficiency, accuracy, and scalability, focusing primarily on normalized discounted cumulative gain (nDCG). Our results show that RecBole outperforms LensKit on two of three metrics on the ML-100K data set: it achieved an 18% higher nDCG, 14% higher precision, and 35% lower recall. To ensure a fair comparison, we adjusted LensKit's nDCG calculation to match RecBole's method. This alignment made the performance more comparable, with LensKit achieving an nDCG of 0.2540 and RecBole 0.2674. Differences in similarity matrix calculations were identified as the main cause of performance deviations. After modifying LensKit to retain only the top K similar items, both libraries showed nearly identical nDCG values across all data sets. For instance, both achieved an nDCG of 0.2586 on the ML-1M data set with the same random seed. Initially, LensKit's original implementation only surpassed RecBole in the ModCloth dataset.


On high-dimensional modifications of the nearest neighbor classifier

arXiv.org Machine Learning

In supervised classification, we use a training set of labeled observations from different competing classes to form a decision rule for classifying unlabeled test set observations as accurately as possible. Starting from Fisher (1936), Rao (1948) and Fix and Hodges (1951), several parametric as well as nonparametric classifiers have been developed for this purpose (see, e.g., Duda et al., 2007; Hastie et al., 2009). Among them, the nearest neighbor classifier (see, e.g., Cover and Hart, 1967) is perhaps the most popular one. The k-nearest neighbor classifier (k-NN) classifies an observation x to the class having the maximum number of representatives among the k nearest neighbors of x. This classifier works well if the training sample size is large compared to the dimension of the data. For a suitable choice of k (which increases with the training sample size at an appropriate rate), under some mild regularity conditions, the misclassification rate of the k-NN classifier converges to the Bayes risk (i.e., the misclassification rate of the Bayes classifier) as the training sample size grows to infinity (see, e.g.


Neurocache: Efficient Vector Retrieval for Long-range Language Modeling

arXiv.org Artificial Intelligence

This paper introduces Neurocache, an approach to extend the effective context size of large language models (LLMs) using an external vector cache to store its past states. Like recent vector retrieval approaches, Neurocache uses an efficient k-nearest-neighbor (kNN) algorithm to retrieve relevant past states and incorporate them into the attention process. Neurocache improves upon previous methods by (1) storing compressed states, which reduces cache size; (2) performing a single retrieval operation per token which increases inference speed; and (3) extending the retrieval window to neighboring states, which improves both language modeling and downstream task accuracy. Our experiments show the effectiveness of Neurocache both for models trained from scratch and for pre-trained models such as Llama2-Figure 1: Performance and Scalability of Neurocache 7B and Mistral-7B when enhanced with the vs. Memorizing Transformers (Wu et al., 2022) on cache mechanism. We also compare Neurocache PG-19: The graph illustrates Neurocache's consistently with text retrieval methods and show lower token perplexity and faster inference times across improvements in single-document questionanswering various cache sizes on the Project Gutenberg-19 dataset, and few-shot learning tasks.