Nearest Neighbor Methods
Lexically-Accelerated Dense Retrieval
Kulkarni, Hrishikesh, MacAvaney, Sean, Goharian, Nazli, Frieder, Ophir
Retrieval approaches that score documents based on learned dense vectors (i.e., dense retrieval) rather than lexical signals (i.e., conventional retrieval) are increasingly popular. Their ability to identify related documents that do not necessarily contain the same terms as those appearing in the user's query (thereby improving recall) is one of their key advantages. However, to actually achieve these gains, dense retrieval approaches typically require an exhaustive search over the document collection, making them considerably more expensive at query-time than conventional lexical approaches. Several techniques aim to reduce this computational overhead by approximating the results of a full dense retriever. Although these approaches reasonably approximate the top results, they suffer in terms of recall -- one of the key advantages of dense retrieval. We introduce 'LADR' (Lexically-Accelerated Dense Retrieval), a simple-yet-effective approach that improves the efficiency of existing dense retrieval models without compromising on retrieval effectiveness. LADR uses lexical retrieval techniques to seed a dense retrieval exploration that uses a document proximity graph. We explore two variants of LADR: a proactive approach that expands the search space to the neighbors of all seed documents, and an adaptive approach that selectively searches the documents with the highest estimated relevance in an iterative fashion. Through extensive experiments across a variety of dense retrieval models, we find that LADR establishes a new dense retrieval effectiveness-efficiency Pareto frontier among approximate k nearest neighbor techniques. Further, we find that when tuned to take around 8ms per query in retrieval latency on our hardware, LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
Few-shot Image Classification based on Gradual Machine Learning
Chen, Na, Kuang, Xianming, Liu, Feiyu, Wang, Kehao, Chen, Qun
Few-shot image classification aims to accurately classify unlabeled images using only a few labeled samples. The state-of-the-art solutions are built by deep learning, which focuses on designing increasingly complex deep backbones. Unfortunately, the task remains very challenging due to the difficulty of transferring the knowledge learned in training classes to new ones. In this paper, we propose a novel approach based on the non-i.i.d paradigm of gradual machine learning (GML). It begins with only a few labeled observations, and then gradually labels target images in the increasing order of hardness by iterative factor inference in a factor graph. Specifically, our proposed solution extracts indicative feature representations by deep backbones, and then constructs both unary and binary factors based on the extracted features to facilitate gradual learning. The unary factors are constructed based on class center distance in an embedding space, while the binary factors are constructed based on k-nearest neighborhood. We have empirically validated the performance of the proposed approach on benchmark datasets by a comparative study. Our extensive experiments demonstrate that the proposed approach can improve the SOTA performance by 1-5% in terms of accuracy. More notably, it is more robust than the existing deep models in that its performance can consistently improve as the size of query set increases while the performance of deep models remains essentially flat or even becomes worse.
Non-invasive Diabetes Detection using Gabor Filter: A Comparative Analysis of Different Cameras
Garcia, Christina A., Abu, Patricia Angela R., Reyes, Rosula SJ.
This paper compares and explores the performance of both mobile device camera and laptop camera as convenient tool for capturing images for non-invasive detection of Diabetes Mellitus (DM) using facial block texture features. Participants within age bracket 20 to 79 years old were chosen for the dataset. 12mp and 7mp mobile cameras, and a laptop camera were used to take the photo under normal lighting condition. Extracted facial blocks were classified using k-Nearest Neighbors (k-NN) and Support Vector Machine (SVM). 100 images were captured, preprocessed, filtered using Gabor, and iterated. Performance of the system was measured in terms of accuracy, specificity, and sensitivity. Best performance of 96.7% accuracy, 100% sensitivity, and 93% specificity were achieved from 12mp back camera using SVM with 100 images.
From Contextual Data to Newsvendor Decisions: On the Actual Performance of Data-Driven Algorithms
Besbes, Omar, Ma, Will, Mouchtaki, Omar
In this work, we explore a framework for contextual decision-making to study how the relevance and quantity of past data affects the performance of a data-driven policy. We analyze a contextual Newsvendor problem in which a decision-maker needs to trade-off between an underage and an overage cost in the face of uncertain demand. We consider a setting in which past demands observed under ``close by'' contexts come from close by distributions and analyze the performance of data-driven algorithms through a notion of context-dependent worst-case expected regret. We analyze the broad class of Weighted Empirical Risk Minimization (WERM) policies which weigh past data according to their similarity in the contextual space. This class includes classical policies such as ERM, k-Nearest Neighbors and kernel-based policies. Our main methodological contribution is to characterize exactly the worst-case regret of any WERM policy on any given configuration of contexts. To the best of our knowledge, this provides the first understanding of tight performance guarantees in any contextual decision-making problem, with past literature focusing on upper bounds via concentration inequalities. We instead take an optimization approach, and isolate a structure in the Newsvendor loss function that allows to reduce the infinite-dimensional optimization problem over worst-case distributions to a simple line search. This in turn allows us to unveil fundamental insights that were obfuscated by previous general-purpose bounds. We characterize actual guaranteed performance as a function of the contexts, as well as granular insights on the learning curve of algorithms.
A Competitive Learning Approach for Specialized Models: A Solution for Complex Physical Systems with Distinct Functional Regimes
Ukorigho, Okezzi F., Owoyele, Opeoluwa
In the era of data-driven science, machine learning has emerged as a transformative tool, offering unprecedented solutions to complex problems across a wide range of scientific and technological domains. Specifically, machine learning has found applications in diverse fields such as biology, medicine, material science, engineering, energy, manufacturing, and agriculture. Notable examples include rapid detection of SARS-CoV-2 Ikponmwoba et al. [2022], advances in drug discovery and development Talevi et al. [2020], quality control and defect detection Wang et al. [2022], climate modeling and prediction Krasnopolsky and Fox-Rabinovitz [2006], as well as crop yield forecasting and optimization Di et al. [2022]. Some of these applications involve classification, which involves learning based on categorical data. In this regard, machine learning techniques, such as Support Vector Machines, Naive Bayes, K-nearest neighbor, and Neural Networks, have been used to extract texture features from images for subsequent classification Chola et al. [2022a,b]. Additionally, machine learning has facilitated the identification of high-order closure terms from fully kinetic simulations, a critical aspect of multi-scale modelingLaperre et al. [2022]. On the other hand, function approximation or regression involves estimating a continuous target quantity as a function of a set of input variables. Methods such as Sparse Identification of Nonlinear Dynamics (SINDy) Brunton et al. [2016], the Least Absolute Shrinkage and Selection Operator (LASSO) Tibshirani [1996], Dynamic Mode Decomposition (DMD) Schmid [2010], Meziฤ [2005], Koopman operator Meziฤ [2013] and the Eigensystem Realization Algorithm (ERA) Juang and Pappa [1985] have contributed significantly to understanding complex systems by offering effective strategies for model selection, variable regularization, decomposition of high-dimensional systems, and extraction of state-space models from input-output data.
FR-LIO: Fast and Robust Lidar-Inertial Odometry by Tightly-Coupled Iterated Kalman Smoother and Robocentric Voxels
Liu, Jun, Zhang, Yunzhou, Zhao, Xiaoyu, He, Zhengnan
This paper presents a fast lidar-inertial odometry (LIO) that is robust to aggressive motion. To achieve robust tracking in aggressive motion scenes, we exploit the continuous scanning property of lidar to adaptively divide the full scan into multiple partial scans (named sub-frames) according to the motion intensity. And to avoid the degradation of sub-frames resulting from insufficient constraints, we propose a robust state estimation method based on a tightly-coupled iterated error state Kalman smoother (ESKS) framework. Furthermore, we propose a robocentric voxel map (RC-Vox) to improve the system's efficiency. The RC-Vox allows efficient maintenance of map points and k nearest neighbor (k-NN) queries by mapping local map points into a fixed-size, two-layer 3D array structure. Extensive experiments are conducted on 27 sequences from 4 public datasets and our own dataset. The results show that our system can achieve stable tracking in aggressive motion scenes (angular velocity up to 21.8 rad/s) that cannot be handled by other state-of-the-art methods, while our system can achieve competitive performance with these methods in general scenes. Furthermore, thanks to the RC-Vox, our system is much faster than the most efficient LIO system currently published.
A Probabilistic Transformation of Distance-Based Outliers
Muhr, David, Affenzeller, Michael, Kรผng, Josef
The scores of distance-based outlier detection methods are difficult to interpret, making it challenging to determine a cut-off threshold between normal and outlier data points without additional context. We describe a generic transformation of distance-based outlier scores into interpretable, probabilistic estimates. The transformation is ranking-stable and increases the contrast between normal and outlier data points. Determining distance relationships between data points is necessary to identify the nearest-neighbor relationships in the data, yet, most of the computed distances are typically discarded. We show that the distances to other data points can be used to model distance probability distributions and, subsequently, use the distributions to turn distance-based outlier scores into outlier probabilities. Our experiments show that the probabilistic transformation does not impact detection performance over numerous tabular and image benchmark datasets but results in interpretable outlier scores with increased contrast between normal and outlier samples. Our work generalizes to a wide range of distance-based outlier detection methods, and because existing distance computations are used, it adds no significant computational overhead.
Systematic Testing of the Data-Poisoning Robustness of KNN
Li, Yannan, Wang, Jingbo, Wang, Chao
Data poisoning aims to compromise a machine learning based software component by contaminating its training set to change its prediction results for test inputs. Existing methods for deciding data-poisoning robustness have either poor accuracy or long running time and, more importantly, they can only certify some of the truly-robust cases, but remain inconclusive when certification fails. In other words, they cannot falsify the truly-non-robust cases. To overcome this limitation, we propose a systematic testing based method, which can falsify as well as certify data-poisoning robustness for a widely used supervised-learning technique named k-nearest neighbors (KNN). Our method is faster and more accurate than the baseline enumeration method, due to a novel over-approximate analysis in the abstract domain, to quickly narrow down the search space, and systematic testing in the concrete domain, to find the actual violations. We have evaluated our method on a set of supervised-learning datasets. Our results show that the method significantly outperforms state-of-the-art techniques, and can decide data-poisoning robustness of KNN prediction results for most of the test inputs.
Certifying the Fairness of KNN in the Presence of Dataset Bias
Li, Yannan, Wang, Jingbo, Wang, Chao
We propose a method for certifying the fairness of the classification result of a widely used supervised learning algorithm, the k-nearest neighbors (KNN), under the assumption that the training data may have historical bias caused by systematic mislabeling of samples from a protected minority group. To the best of our knowledge, this is the first certification method for KNN based on three variants of the fairness definition: individual fairness, $\epsilon$-fairness, and label-flipping fairness. We first define the fairness certification problem for KNN and then propose sound approximations of the complex arithmetic computations used in the state-of-the-art KNN algorithm. This is meant to lift the computation results from the concrete domain to an abstract domain, to reduce the computational cost. We show effectiveness of this abstract interpretation based technique through experimental evaluation on six datasets widely used in the fairness research literature. We also show that the method is accurate enough to obtain fairness certifications for a large number of test inputs, despite the presence of historical bias in the datasets.
SALC: Skeleton-Assisted Learning-Based Clustering for Time-Varying Indoor Localization
Hsiao, An-Hung, Shen, Li-Hsiang, Chang, Chen-Yi, Chiu, Chun-Jie, Feng, Kai-Ten
Wireless indoor localization has attracted significant amount of attention in recent years. Using received signal strength (RSS) obtained from WiFi access points (APs) for establishing fingerprinting database is a widely utilized method in indoor localization. However, the time-variant problem for indoor positioning systems is not well-investigated in existing literature. Compared to conventional static fingerprinting, the dynamicallyreconstructed database can adapt to a highly-changing environment, which achieves sustainability of localization accuracy. To deal with the time-varying issue, we propose a skeleton-assisted learning-based clustering localization (SALC) system, including RSS-oriented map-assisted clustering (ROMAC), cluster-based online database establishment (CODE), and cluster-scaled location estimation (CsLE). The SALC scheme jointly considers similarities from the skeleton-based shortest path (SSP) and the time-varying RSS measurements across the reference points (RPs). ROMAC clusters RPs into different feature sets and therefore selects suitable monitor points (MPs) for enhancing location estimation. Moreover, the CODE algorithm aims for establishing adaptive fingerprint database to alleviate the timevarying problem. Finally, CsLE is adopted to acquire the target position by leveraging the benefits of clustering information and estimated signal variations in order to rescale the weights fromweighted k-nearest neighbors (WkNN) method. Both simulation and experimental results demonstrate that the proposed SALC system can effectively reconstruct the fingerprint database with an enhanced location estimation accuracy, which outperforms the other existing schemes in the open literature.