Goto

Collaborating Authors

 Information Retrieval


Semantic Search and Recommendation Algorithm

arXiv.org Artificial Intelligence

Abstract--This paper details the development of a novel semantic search algorithm utilizing Word2Vec and Annoy Index to efficiently process and retrieve information from large datasets. Addressing traditional search algorithms' limitations, our proposed method demonstrates significant improvements in speed, accuracy, and scalability, validated by rigorous testing on datasets up to 100GB. In the era of big data, efficiently retrieving relevant information from vast, unstructured datasets is crucial across numerous domains such as e-commerce, healthcare, research, and public administration. Traditional search engines, which rely primarily on keyword matching, often struggle with the inherent complexity and ambiguity of natural language. These systems lack the ability to understand the semantic meaning and context of queries, leading to inaccurate results and suboptimal user experiences. The evolution of semantic search technologies aims to address these limitations by focusing on understanding the in high-dimensional space.


Fuzzy Norm-Explicit Product Quantization for Recommender Systems

arXiv.org Artificial Intelligence

As the data resources grow, providing recommendations that best meet the demands has become a vital requirement in business and life to overcome the information overload problem. However, building a system suggesting relevant recommendations has always been a point of debate. One of the most cost-efficient techniques in terms of producing relevant recommendations at a low complexity is Product Quantization (PQ). PQ approaches have continued developing in recent years. This system's crucial challenge is improving product quantization performance in terms of recall measures without compromising its complexity. This makes the algorithm suitable for problems that require a greater number of potentially relevant items without disregarding others, at high-speed and low-cost to keep up with traffic. This is the case of online shops where the recommendations for the purpose are important, although customers can be susceptible to scoping other products. This research proposes a fuzzy approach to perform norm-based product quantization. Type-2 Fuzzy sets (T2FSs) define the codebook allowing sub-vectors (T2FSs) to be associated with more than one element of the codebook, and next, its norm calculus is resolved by means of integration. Our method finesses the recall measure up, making the algorithm suitable for problems that require querying at most possible potential relevant items without disregarding others. The proposed method outperforms all PQ approaches such as NEQ, PQ, and RQ up to +6%, +5%, and +8% by achieving a recall of 94%, 69%, 59% in Netflix, Audio, Cifar60k datasets, respectively. More and over, computing time and complexity nearly equals the most computationally efficient existing PQ method in the state-of-the-art.


CardOOD: Robust Query-driven Cardinality Estimation under Out-of-Distribution

arXiv.org Artificial Intelligence

Query-driven learned estimators are accurate, flexible, and lightweight alternatives to traditional estimators in query optimization. However, existing query-driven approaches struggle with the Out-of-distribution (OOD) problem, where the test workload distribution differs from the training workload, leading to performancedegradation. In this paper, we present CardOOD, a general learning framework designed to construct robust query-driven cardinality estimators that are resilient against the OOD problem. Our framework focuses on offline training algorithms that develop one-off models from a static workload, suitable for model initialization and periodic retraining. In CardOOD, we extend classical transfer/robust learning techniques to train query-driven cardinalityestimators, and the algorithms fall into three categories: representation learning, data manipulation, and new learning strategies. As these learning techniques are originally evaluated in computervision tasks, we also propose a new learning algorithm that exploits the property of cardinality estimation. This algorithm, lying in the category of new learning strategy, models the partial order constraint of cardinalities by a self-supervised learning task. Comprehensive experimental studies demonstrate the efficacy of the algorithms of CardOOD in mitigating the OOD problem to varying extents. We further integrate CardOOD into PostgreSQL, showcasing its practical utility in query optimization.


Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings

arXiv.org Artificial Intelligence

While some metrics cover clinical entities and their relations[9, 11], generally Several evaluation metrics have been developed recently to scoring metrics do not explicitly capture the textual mention automatically assess the quality of generative AI reports for differences in the anatomy, laterality and severity. Further, chest radiographs based only on textual information using phrasal grounding of the findings in terms of anatomical localization lexical, semantic, or clinical named entity recognition methods. in images is not exploited in the quality scoring. In this paper, we develop a new method of report quality In this paper, we propose a metric that captures both finegrained evaluation by first extracting fine-grained finding patterns textual descriptions of findings as well as their phrasal capturing the location, laterality, and severity of a large number grounding information in terms of anatomical locations in images. of clinical findings. We then performed phrasal grounding We present results that compare this evaluation metric to localize their associated anatomical regions on chest radiograph to other textual metrics on a gold standard dataset derived images. The textual and visual measures are then combined from MIMIC collection of chest X-rays and validated reports, to rate the quality of the generated reports. We present to show its robustness and sensitivity to factual errors.


A Survey of Large Language Model-Based Generative AI for Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges

arXiv.org Artificial Intelligence

Text-to-SQL systems facilitate smooth interaction with databases by translating natural language queries into Structured Query Language (SQL), bridging the gap between non-technical users and complex database management systems. This survey provides a comprehensive overview of the evolution of AI-driven text-to-SQL systems, highlighting their foundational components, advancements in large language model (LLM) architectures, and the critical role of datasets such as Spider, WikiSQL, and CoSQL in driving progress. We examine the applications of text-to-SQL in domains like healthcare, education, and finance, emphasizing their transformative potential for improving data accessibility. Additionally, we analyze persistent challenges, including domain generalization, query optimization, support for multi-turn conversational interactions, and the limited availability of datasets tailored for NoSQL databases and dynamic real-world scenarios. To address these challenges, we outline future research directions, such as extending text-to-SQL capabilities to support NoSQL databases, designing datasets for dynamic multi-turn interactions, and optimizing systems for real-world scalability and robustness. By surveying current advancements and identifying key gaps, this paper aims to guide the next generation of research and applications in LLM-based text-to-SQL systems.


Semantic Retrieval at Walmart

arXiv.org Artificial Intelligence

In product search, the retrieval of candidate products before re-ranking is more critical and challenging than other search like web search, especially for tail queries, which have a complex and specific search intent. In this paper, we present a hybrid system for e-commerce search deployed at Walmart that combines traditional inverted index and embedding-based neural retrieval to better answer user tail queries. Our system significantly improved the relevance of the search engine, measured by both offline and online evaluations. The improvements were achieved through a combination of different approaches. We present a new technique to train the neural model at scale. and describe how the system was deployed in production with little impact on response time. We highlight multiple learnings and practical tricks that were used in the deployment of this system.


Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based on gradual information disclosure

arXiv.org Artificial Intelligence

Record linkage, also known as entity resolution, aims at identifying different representations of the same real-world entity, such as a person. It is a crucial step in many data integration tasks in order to combine multiple data sources allowing enhanced data analysis. Typically, unique record identifiers are not available which would enable a join-like operation. Therefore, records are compared pairwise based on their identifying attributes, such as first name, last name and date of birth, and classified as match or non-match. However, record linkage may potentially harm the privacy of individuals by combining information that can be used against their interests. As a consequence, the conduction of such a linkage is subject to many legal and organizational constraints [CRS20]. Privacypreserving record linkage (PPRL) methods aim for enabling such linkages without sharing sensitive plaintext information between the data owners or with a third party. To protect the identifying data, the data owners encode it before sending it to an independent linkage unit which performs the matching on the encoded data only. A variety of such perturbation-based encoding techniques have been proposed, but the most popular and a quasi-standard is based on Bloom filters [Gk21].


AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer

arXiv.org Artificial Intelligence

This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.


HERO: Hint-Based Efficient and Reliable Query Optimizer

arXiv.org Artificial Intelligence

We propose a novel model for learned query optimization which provides query hints leading to better execution plans. The model addresses the three key challenges in learned hint-based query optimization: reliable hint recommendation (ensuring non-degradation of query latency), efficient hint exploration, and fast inference. We provide an in-depth analysis of existing NN-based approaches to hint-based optimization and experimentally confirm the named challenges for them. Our alternative solution consists of a new inference schema based on an ensemble of context-aware models and a graph storage for reliable hint suggestion and fast inference, and a budget-controlled training procedure with a local search algorithm that solves the issue of exponential search space exploration. In experiments on standard benchmarks, our model demonstrates optimization capability close to the best achievable with coarse-grained hints. Controlling the degree of parallelism (query dop) in addition to operator-related hints enables our model to achieve 3x latency improvement on JOB benchmark which sets a new standard for optimization. Our model is interpretable and easy to debug, which is particularly important for deployment in production.


Automated Test-Case Generation for REST APIs Using Model Inference Search Heuristic

arXiv.org Artificial Intelligence

The rising popularity of the microservice architectural style has led to a growing demand for automated testing approaches tailored to these systems. EvoMaster is a state-of-the-art tool that uses Evolutionary Algorithms (EAs) to automatically generate test cases for microservices' REST APIs. One limitation of these EAs is the use of unit-level search heuristics, such as branch distances, which focus on fine-grained code coverage and may not effectively capture the complex, interconnected behaviors characteristic of system-level testing. To address this limitation, we propose a new search heuristic (MISH) that uses real-time automaton learning to guide the test case generation process. We capture the sequential call patterns exhibited by a test case by learning an automaton from the stream of log events outputted by different microservices within the same system. Therefore, MISH learns a representation of the systemwide behavior, allowing us to define the fitness of a test case based on the path it traverses within the inferred automaton. We empirically evaluate MISH's effectiveness on six real-world benchmark microservice applications and compare it against a state-of-the-art technique, MOSA, for testing REST APIs. Our evaluation shows promising results for using MISH to guide the automated test case generation within EvoMaster.