Nearest Neighbor Methods
Nearest Neighbor Classifier with Margin Penalty for Active Learning
Cao, Yuan, Gao, Zhiqiao, Hu, Jie, Yang, Mingchuan, Chen, Jinpeng
As deep learning becomes the mainstream in the field of natural language processing, the need for suitable active learning method are becoming unprecedented urgent. Active Learning (AL) methods based on nearest neighbor classifier are proposed and demonstrated superior results. However, existing nearest neighbor classifier are not suitable for classifying mutual exclusive classes because inter-class discrepancy cannot be assured by nearest neighbor classifiers. As a result, informative samples in the margin area can not be discovered and AL performance are damaged. To this end, we propose a novel Nearest neighbor Classifier with Margin penalty for Active Learning(NCMAL). Firstly, mandatory margin penalty are added between classes, therefore both inter-class discrepancy and intra-class compactness are both assured. Secondly, a novel sample selection strategy are proposed to discover informative samples within the margin area. To demonstrate the effectiveness of the methods, we conduct extensive experiments on for datasets with other state-of-the-art methods. The experimental results demonstrate that our method achieves better results with fewer annotated samples than all baseline methods.
Generating Synthetic Data with The Nearest Neighbors Algorithm
The $k$ nearest neighbor algorithm ($k$NN) is one of the most popular nonparametric methods used for various purposes, such as treatment effect estimation, missing value imputation, classification, and clustering. The main advantage of $k$NN is its simplicity of hyperparameter optimization. It often produces favorable results with minimal effort. This paper proposes a generic semiparametric (or nonparametric if required) approach named Local Resampler (LR). LR utilizes $k$NN to create subsamples from the original sample and then generates synthetic values that are drawn from locally estimated distributions. LR can accurately create synthetic samples, even if the original sample has a non-convex distribution. Moreover, LR shows better or similar performance to other popular synthetic data methods with minimal model optimization with parametric distributional assumptions.
Distance Based Image Classification: A solution to generative classification's conundrum?
Lin, Wen-Yan, Liu, Siying, Dai, Bing Tian, Li, Hongdong
Most classifiers rely on discriminative boundaries that separate instances of each class from everything else. We argue that discriminative boundaries are counter-intuitive as they define semantics by what-they-are-not; and should be replaced by generative classifiers which define semantics by what-they-are. Unfortunately, generative classifiers are significantly less accurate. This may be caused by the tendency of generative models to focus on easy to model semantic generative factors and ignore non-semantic factors that are important but difficult to model. We propose a new generative model in which semantic factors are accommodated by shell theory's hierarchical generative process and non-semantic factors by an instance specific noise term. We use the model to develop a classification scheme which suppresses the impact of noise while preserving semantic cues. The result is a surprisingly accurate generative classifier, that takes the form of a modified nearest-neighbor algorithm; we term it distance classification. Unlike discriminative classifiers, a distance classifier: defines semantics by what-they-are; is amenable to incremental updates; and scales well with the number of classes.
WGICP: Differentiable Weighted GICP-Based Lidar Odometry
Son, Sanghyun, Liang, Jing, Lin, Ming, Manocha, Dinesh
We present a novel differentiable weighted generalized iterative closest point (WGICP) method applicable to general 3D point cloud data, including that from Lidar. Our method builds on differentiable generalized ICP (GICP), and we propose using the differentiable K-Nearest Neighbor (KNN) algorithm to enhance differentiability. The differentiable GICP algorithm provides the gradient of output pose estimation with respect to each input point, which allows us to train a neural network to predict its importance, or weight, in estimating the correct pose. In contrast to the other ICP-based methods that use voxel-based downsampling or matching methods to reduce the computational cost, our method directly reduces the number of points used for GICP by only selecting those with the highest weights and ignoring redundant ones with lower weights. We show that our method improves both accuracy and speed of the GICP algorithm for the KITTI dataset and can be used to develop a more robust and efficient SLAM system.
KNN-Diffusion: Image Generation via Large-Scale Retrieval
Sheynin, Shelly, Ashual, Oron, Polyak, Adam, Singer, Uriel, Gafni, Oran, Nachmani, Eliya, Taigman, Yaniv
Figure 1: (a) Samples of stickers generated from text inputs, (b) Semantic text-guided manipulations applied to the "Original" image without using edit masks. In both cases, our model was trained without any text data. Recent text-to-image models have achieved impressive results. However, since they require large-scale datasets of text-image pairs, it is impractical to train them on new domains where data is scarce or not labeled. In this work, we propose using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a substantially small and efficient text-to-image diffusion model without any text, (2) generating out-ofdistribution images by simply swapping the retrieval database at inference time, and (3) performing text-driven local semantic manipulations while preserving object identity. To demonstrate the robustness of our method, we apply our kNN approach on two state-of-the-art diffusion backbones, and show results on several different datasets. As evaluated by human studies and automatic metrics, our method achieves state-of-the-art results compared to existing approaches that train text-to-image generation models using images only (without paired text data). Large-scale generative models have been applied successfully to image generation tasks (Gafni et al., 2022; Ramesh et al., 2021; Nichol et al., 2021; Saharia et al., 2022; Yu et al., 2022), and have shown outstanding capabilities in extending human creativity using editing and user control. However, these models face several significant challenges: (i) Large-scale paired data requirement. To achieve high-quality results, text-to-image models rely heavily on large-scale datasets of (text, image) pairs collected from the internet. Due to the requirement of paired data, these models cannot be applied to new or customized domains with only unannotated images. Training these models on highly complex distributions of natural images usually requires scaling the size of the model, data, batch-size, and training time, which makes them challenging to train and less accessible to the community.
Local Distance Preserving Auto-encoders using Continuous k-Nearest Neighbours Graphs
Chen, Nutan, van der Smagt, Patrick, Cseke, Botond
Auto-encoder models that preserve similarities in the data are a popular tool in representation learning. In this paper we introduce several auto-encoder models that preserve local distances when mapping from the data space to the latent space. We use a local distance-preserving loss that is based on the continuous k-nearest neighbours graph which is known to capture topological features at all scales simultaneously. To improve training performance, we formulate learning as a constraint optimisation problem with local distance preservation as the main objective and reconstruction accuracy as a constraint. Our method provides state-ofthe-art or comparable performance across several standard datasets and evaluation metrics. Auto-encoders and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014) are often used in machine learning to find meaningful latent representations of the data. What constitutes meaningful usually depends on the application and on the downstream tasks, for example, finding representations that have important factors of variations in the data (disentanglement) (Higgins et al., 2017; Chen et al., 2018), have high mutual information with the data (Chen et al., 2016), or show clustering behaviour w.r.t. These representations are usually incentivised by regularisers or architectural/structural choices. One criterion for finding a meaningful latent representation is geometric faithfulness to the data. This is important for data visualisation or further downstream tasks that involve geometric algorithms such as clustering or kNN classification. The data often lies in a small, sparse, low-dimensional manifold in the space it inhabits and finding a lower dimensional projection that is geometrically faithful to it can help not only in visualisation and interpretability but also in predictive performance and robustness (e.g.
PL-kNN: A Parameterless Nearest Neighbors Classifier
Jodas, Danilo Samuel, Passos, Leandro Aparecido, Adeel, Ahsan, Papa, Joรฃo Paulo
Demands for minimum parameter setup in machine learning models are desirable to avoid time-consuming optimization processes. The $k$-Nearest Neighbors is one of the most effective and straightforward models employed in numerous problems. Despite its well-known performance, it requires the value of $k$ for specific data distribution, thus demanding expensive computational efforts. This paper proposes a $k$-Nearest Neighbors classifier that bypasses the need to define the value of $k$. The model computes the $k$ value adaptively considering the data distribution of the training set. We compared the proposed model against the standard $k$-Nearest Neighbors classifier and two parameterless versions from the literature. Experiments over 11 public datasets confirm the robustness of the proposed approach, for the obtained results were similar or even better than its counterpart versions.
Machine learning-accelerated chemistry modeling of protoplanetary disks
Smirnov-Pinchukov, Grigorii V., Molyarova, Tamara, Semenov, Dmitry A., Akimkin, Vitaly V., van Terwisga, Sierk, Francheschi, Riccardo, Henning, Thomas
Aims. With the large amount of molecular emission data from (sub)millimeter observatories and incoming James Webb Space Telescope infrared spectroscopy, access to fast forward models of the chemical composition of protoplanetary disks is of paramount importance. Methods. We used a thermo-chemical modeling code to generate a diverse population of protoplanetary disk models. We trained a K-nearest neighbors (KNN) regressor to instantly predict the chemistry of other disk models. Results. We show that it is possible to accurately reproduce chemistry using just a small subset of physical conditions, thanks to correlations between the local physical conditions in adopted protoplanetary disk models. We discuss the uncertainties and limitations of this method. Conclusions. The proposed method can be used for Bayesian fitting of the line emission data to retrieve disk properties from observations. We present a pipeline for reproducing the same approach on other disk chemical model sets.
A Novel Nearest Neighbors Algorithm Based on Power Muirhead Mean
Shahnazari, Kourosh, Ayyoubzadeh, Seyed Moein
K-Nearest Neighbors algorithm is one of the most used classifiers in terms of simplicity and performance. Although, when a dataset has many outliers or when it is small or unbalanced, KNN doesn't work well. This paper aims to propose a novel classifier, based on K-Nearest Neighbors which calculates the local means of every class using the Power Muirhead Mean operator to overcome alluded issues. We called our new algorithm Power Muirhead Mean K-Nearest Neighbors (PMM-KNN). Eventually, we used five well-known datasets to assess PMM-KNN performance. The research results demonstrate that the PMM-KNN has outperformed three state-of-the-art classification methods in all experiments.
Text Independent Speaker Identification System for Access Control
Even human intelligence system fails to offer 100% accuracy in identifying speeches from a specific individual. Machine intelligence is trying to mimic humans in speaker identification problems through various approaches to speech feature extraction and speech modeling techniques. This paper presents a text-independent speaker identification system that employs Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and k-Nearest Neighbor (kNN) for classification. The maximum cross-validation accuracy obtained was 60%. This will be improved upon in subsequent research.