knn search
mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search
Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that aims to identify and classify entities in text into predefined categories. However, when applied to Arabic data, NER encounters unique challenges stemming from the language's rich morphological inflections, absence of capitalization cues, and spelling variants, where a single word can comprise multiple morphemes. In this paper, we introduce Arabic KNN-NER, our submission to the Wojood NER Shared Task 2024 (ArabicNLP 2024). We have participated in the shared sub-task 1 Flat NER. In this shared sub-task, we tackle fine-grained flat-entity recognition for Arabic text, where we identify a single main entity and possibly zero or multiple sub-entities for each word. Arabic KNN-NER augments the probability distribution of a fine-tuned model with another label probability distribution derived from performing a KNN search over the cached training data. Our submission achieved 91% on the test set on the WojoodFine dataset, placing Arabic KNN-NER on top of the leaderboard for the shared task.
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
Generating Diverse Translation with Perturbed kNN-MT
Nishida, Yuto, Morishita, Makoto, Kamigaito, Hidetaka, Watanabe, Taro
Generating multiple translation candidates would enable users to choose the one that satisfies their needs. Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem -- the model underestimates a prediction that is largely different from the training data, even if that prediction is likely. This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT). Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem. Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation's magnitude.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- (17 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.55)
i-Octree: A Fast, Lightweight, and Dynamic Octree for Proximity Search
Zhu, Jun, Li, Hongyi, Wang, Shengjie, Wang, Zhepeng, Zhang, Tao
Establishing the correspondences between newly acquired points and historically accumulated data (i.e., map) through nearest neighbors search is crucial in numerous robotic applications.However, static tree data structures are inadequate to handle large and dynamically growing maps in real-time.To address this issue, we present the i-Octree, a dynamic octree data structure that supports both fast nearest neighbor search and real-time dynamic updates, such as point insertion, deletion, and on-tree down-sampling. The i-Octree is built upon a leaf-based octree and has two key features: a local spatially continuous storing strategy that allows for fast access to points while minimizing memory usage, and local on-tree updates that significantly reduce computation time compared to existing static or dynamic tree structures.The experiments show that i-Octree surpasses state-of-the-art methods by reducing run-time by over 50% on real-world open datasets.
- North America > United States > New York (0.04)
- North America > United States > Michigan (0.04)
LIO-PPF: Fast LiDAR-Inertial Odometry via Incremental Plane Pre-Fitting and Skeleton Tracking
Chen, Xingyu, Wu, Peixi, Li, Ge, Li, Thomas H.
As a crucial infrastructure of intelligent mobile robots, LiDAR-Inertial odometry (LIO) provides the basic capability of state estimation by tracking LiDAR scans. The high-accuracy tracking generally involves the kNN search, which is used with minimizing the point-to-plane distance. The cost for this, however, is maintaining a large local map and performing kNN plane fit for each point. In this work, we reduce both time and space complexity of LIO by saving these unnecessary costs. Technically, we design a plane pre-fitting (PPF) pipeline to track the basic skeleton of the 3D scene. In PPF, planes are not fitted individually for each scan, let alone for each point, but are updated incrementally as the scene 'flows'. Unlike kNN, the PPF is more robust to noisy and non-strict planes with our iterative Principal Component Analyse (iPCA) refinement. Moreover, a simple yet effective sandwich layer is introduced to eliminate false point-to-plane matches. Our method was extensively tested on a total number of 22 sequences across 5 open datasets, and evaluated in 3 existing state-of-the-art LIO systems. By contrast, LIO-PPF can consume only 36% of the original local map size to achieve up to 4x faster residual computing and 1.92x overall FPS, while maintaining the same level of accuracy. We fully open source our implementation at https://github.com/xingyuuchen/LIO-PPF.
- North America > United States > Michigan (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (2 more...)
Non-parametric, Nearest-neighbor-assisted Fine-tuning for Neural Machine Translation
Wang, Jiayi, Wang, Ke, Zhang, Yuqi, Zhao, Yu, Stenetorp, Pontus
Non-parametric, k-nearest-neighbor algorithms have recently made inroads to assist generative models such as language models and machine translation decoders. We explore whether such non-parametric models can improve machine translation models at the fine-tuning stage by incorporating statistics from the kNN predictions to inform the gradient updates for a baseline translation model. There are multiple methods which could be used to incorporate kNN statistics and we investigate gradient scaling by a gating mechanism, the kNN's ground truth probability, and reinforcement learning. For four standard in-domain machine translation datasets, compared with classic fine-tuning, we report consistent improvements of all of the three methods by as much as 1.45 BLEU and 1.28 BLEU for German-English and English-German translations respectively. Through qualitative analysis, we found particular improvements when it comes to translating grammatical relations or function words, which results in increased fluency of our model.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (6 more...)
Why do Nearest Neighbor Language Models Work?
Xu, Frank F., Alon, Uri, Neubig, Graham
Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform a careful analysis of the various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate these insights into the model architecture or the training procedure of the standard parametric LM, improving its results without the need for an explicit retrieval component.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Middle East > Jordan (0.04)
Deterministic Iteratively Built KD-Tree with KNN Search for Exact Applications
Naim, Aryan, Bowkett, Joseph, Karumanchi, Sisir, Tavallali, Peyman, Kennedy, Brett
K-Nearest Neighbors (KNN) search is a fundamental algorithm in artificial intelligence software with applications in robotics, and autonomous vehicles. These wide-ranging applications utilize KNN either directly for simple classification or combine KNN results as input to other algorithms such as Locally Weighted Learning (LWL). Similar to binary trees, kd-trees become unbalanced as new data is added in online applications which can lead to rapid degradation in search performance unless the tree is rebuilt. Although approximate methods are suitable for graphics applications, which prioritize query speed over query accuracy, they are unsuitable for certain applications in autonomous systems, aeronautics, and robotic manipulation where exact solutions are desired. In this paper, we will attempt to assess the performance of non-recursive deterministic kd-tree functions and KNN functions. We will also present a "forest of interval kd-trees" which reduces the number of tree rebuilds, without compromising the exactness of query results.
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States > Massachusetts (0.04)
- (6 more...)
K-nearest Neighbor Search by Random Projection Forests
Yan, Donghui, Wang, Yingjie, Wang, Jin, Wang, Honggang, Li, Zhenpeng
K-nearest neighbor (kNN) search refers to the problem of finding K points closest toa given data point on a distance metric of interest. It is an important task in a wide range of applications, including similarity search in data mining [15,19], fast kernel methods in machine learning [17, 30, 38], nonparametric density estimation [5, 29, 31] and intrinsic dimension estimation [6, 26] in statistics, aswell as anomaly detection algorithms [2, 10, 37]. Numerous algorithms have been proposed for kNN search; the readers are referred to [35, 46] and references therein. Our interest is kNN search in emerging applications. Two 1 salient features of such applications are the expected scalability of the algorithms andtheir ability to handle data of high dimensionality. Additionally, such applications often desire more accurate kNN search. For example, robotic route planning [23] and face-based surveillance systems [34] require a high accuracy forthe robust execution of tasks. However, most existing work on kNN search [1, 4, 12, 15] have focused mainly on the fast computation and accuracy isofalessconcern.
- North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Wisconsin (0.04)
- (3 more...)
Compact Multi-Label Learning
Shen, Xiaobo (Nanyang Technological University) | Liu, Weiwei (The University of New South Wales) | Tsang, Ivor W. (University of Technology Sydney) | Sun, Quan-Sen (Nanjing University of Science and Technology) | Ong, Yew-Soon (Nanyang Technological University)
Embedding methods have shown promising performance in multi-label prediction, as they can discover the dependency of labels. Most embedding methods cannot well align the input and output, which leads to degradation in prediction performance. Besides, they suffer from expensive prediction computational costs when applied to large-scale datasets. To address the above issues, this paper proposes a Co-Hashing (CoH) method by formulating multi-label learning from the perspective of cross-view learning. CoH first regards the input and output as two views, and then aims to learn a common latent hamming space, where input and output pairs are compressed into compact binary embeddings. CoH enjoys two key benefits: 1) the input and output can be well aligned, and their correlations are explored; 2) the prediction is very efficient using fast cross-view kNN search in the hamming space. Moreover, we provide the generalization error bound for our method. Extensive experiments on eight real-world datasets demonstrate the superiority of the proposed CoH over the state-of-the-art methods in terms of both prediction accuracy and efficiency.
- Oceania > Australia > New South Wales (0.04)
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Data Science (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Operationalizing Data Science Models on the Pivotal Stack
At Pivotal Data Science, our primary charter is to help our customers derive value from their data assets, be it in the reduction of cost or by increasing revenue by offering better products and services. While we are not working on customer engagements, we engage in R&D using our wide array of products. For instance, we may contribute a new module to PDLTools or MADlib - our distributed in-database machine learning libraries, we might build end-to-end demos such as these or experiment with new technology and blog about them here. Last quarter, we set out to explore data science microservices for operationalizing our models for real-time scoring. Microservices have been the most talked about topic in many Cloud conferences of late. They've gained a large fan following by application developers, solution architects, data scientists and engineers alike.