Goto

Collaborating Authors

 Information Retrieval


LLM-based IR-system for Bank Supervisors

arXiv.org Artificial Intelligence

Bank supervisors face the complex task of ensuring that new measures are consistently aligned with historical precedents. To address this challenge, we introduce a novel Information Retrieval (IR) System tailored to assist supervisors in drafting both consistent and effective measures. This system ingests findings from on-site investigations. It then retrieves the most relevant historical findings and their associated measures from a comprehensive database, providing a solid basis for supervisors to write well-informed measures for new findings. Utilizing a blend of lexical, semantic, and Capital Requirements Regulation (CRR) fuzzy set matching techniques, the IR system ensures the retrieval of findings that closely align with current cases. The performance of this system, particularly in scenarios with partially labeled data, is validated through a Monte Carlo methodology, showcasing its robustness and accuracy. Enhanced by a Transformer-based Denoising AutoEncoder for fine-tuning, the final model achieves a Mean Average Precision (MAP@100) of 0.83 and a Mean Reciprocal Rank (MRR@100) of 0.92. These scores surpass those of both standalone lexical models such as BM25 and semantic BERT-like models.


A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination

arXiv.org Artificial Intelligence

Objective: This review aims to explore the potential and challenges of using Natural Language Processing (NLP) to detect, correct, and mitigate medically inaccurate information, including errors, misinformation, and hallucination. By unifying these concepts, the review emphasizes their shared methodological foundations and their distinct implications for healthcare. Our goal is to advance patient safety, improve public health communication, and support the development of more reliable and transparent NLP applications in healthcare. Methods: A scoping review was conducted following PRISMA guidelines, analyzing studies from 2020 to 2024 across five databases. Studies were selected based on their use of NLP to address medically inaccurate information and were categorized by topic, tasks, document types, datasets, models, and evaluation metrics. Results: NLP has shown potential in addressing medically inaccurate information on the following tasks: (1) error detection (2) error correction (3) misinformation detection (4) misinformation correction (5) hallucination detection (6) hallucination mitigation. However, challenges remain with data privacy, context dependency, and evaluation standards. Conclusion: This review highlights the advancements in applying NLP to tackle medically inaccurate information while underscoring the need to address persistent challenges. Future efforts should focus on developing real-world datasets, refining contextual methods, and improving hallucination management to ensure reliable and transparent healthcare applications.


HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark

arXiv.org Artificial Intelligence

Current benchmarks for Hebrew Natural Language Processing (NLP) focus mainly on morpho-syntactic tasks, neglecting the semantic dimension of language understanding. To bridge this gap, we set out to deliver a Hebrew Machine Reading Comprehension (MRC) dataset, where MRC is to be realized as extractive Question Answering. The morphologically rich nature of Hebrew poses a challenge to this endeavor: the indeterminacy and non-transparency of span boundaries in morphologically complex forms lead to annotation inconsistencies, disagreements, and flaws in standard evaluation metrics. To remedy this, we devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics that are suitable for the morphologically rich nature of the language. Our resulting benchmark, HeQ (Hebrew QA), features 30,147 diverse question-answer pairs derived from both Hebrew Wikipedia articles and Israeli tech news. Our empirical investigation reveals that standard evaluation metrics such as F1 scores and Exact Match (EM) are not appropriate for Hebrew (and other MRLs), and we propose a relevant enhancement. In addition, our experiments show low correlation between models' performance on morpho-syntactic tasks and on MRC, which suggests that models designed for the former might underperform on semantics-heavy tasks. The development and exploration of HeQ illustrate some of the challenges MRLs pose in natural language understanding (NLU), fostering progression towards more and better NLU models for Hebrew and other MRLs.


Segment First, Retrieve Better: Realistic Legal Search via Rhetorical Role-Based Queries

arXiv.org Artificial Intelligence

Legal precedent retrieval is a cornerstone of the common law system, governed by the principle of stare decisis, which demands consistency in judicial decisions. However, the growing complexity and volume of legal documents challenge traditional retrieval methods. TraceRetriever mirrors real-world legal search by operating with limited case information, extracting only rhetorically significant segments instead of requiring complete documents. Our pipeline integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion before final re-ranking. Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier trained on Indian judgments. Evaluated on IL-PCR and COLIEE 2025 datasets, TraceRetriever addresses growing document volume challenges while aligning with practical search constraints, reliable and scalable foundation for precedent retrieval enhancing legal research when only partial case knowledge is available.


AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor Towards Scaling Workloads

arXiv.org Artificial Intelligence

Efficiently selecting indexes is fundamental to database performance optimization, particularly for systems handling large-scale analytical workloads. While deep reinforcement learning (DRL) has shown promise in automating index selection through its ability to learn from experience, few works address how these RL-based index advisors can adapt to scaling workloads due to exponentially growing action spaces and heavy trial and error. To address these challenges, we introduce AutoIndexer, a framework that combines workload compression, query optimization, and specialized RL models to scale index selection effectively. By operating on compressed workloads, AutoIndexer substantially lowers search complexity without sacrificing much index quality. Extensive evaluations show that it reduces end-to-end query execution time by up to 95% versus non-indexed baselines. On average, it outperforms state-of-the-art RL-based index advisors by approximately 20% in workload cost savings while cutting tuning time by over 50%. These results affirm AutoIndexer's practicality for large and diverse workloads.


RAVine: Reality-Aligned Evaluation for Agentic Search

arXiv.org Artificial Intelligence

Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.


Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

arXiv.org Artificial Intelligence

ABSTRACT Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still un-derrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions.


A Fisher's exact test justification of the TF-IDF term-weighting scheme

arXiv.org Artificial Intelligence

Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term's occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher's exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher's exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme's long-established effectiveness.


InsurTech innovation using natural language processing

arXiv.org Machine Learning

InsurTech refers to the use of state-of-the-art technology, including both emerging hardware and software, to address inefficiencies across the insurance value chain and further explore new opportunities to reshape traditional business operations. InsurTech encompasses a broad spectrum of technology-driven innovations, including, but not limited to, telematics, usage-based insurance, and the integration of Internet of Things (IoT) sensors. In this study, we focus on a specific class of InsurTech, an Insurtech data vendor, that provides insurance companies with next-generation data solutions. We leverage new and diverse external data sources, such as social media data and online content, to enrich the internal database, thereby empowering actuarial analytics and gaining more accurate insights into risk profiles and policyholder behavior. Specifically, by integrating alternative data sources beyond traditional information, insurance companies can uncover previously unrecognized risk factors, reduce bias in existing features, and identify more accurate risk exposures based on the operational characteristics of the insured entities.


Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents

arXiv.org Artificial Intelligence

Enterprise environments contain a heterogeneous, rapidly growing collection of internal artifacts related to code, data, and many different tools. Critical information for assessing privacy risk and ensuring regulatory compliance is often embedded across these varied resources, each with their own arcane discovery and extraction techniques. Therefore, large-scale privacy compliance in adherence to governmental regulations requires systems to discern the interconnected nature of diverse artifacts in a common, shared universe. We present Privacy Artifact ConnecT or (PACT), an embeddings-driven graph that links millions of artifacts spanning multiple artifact types generated by a variety of teams and projects. Powered by the state-of-the-art DRAGON embedding model, PACT uses a contrastive learning objective with light fine-tuning to link artifacts via their textual components such as raw metadata, ownership specifics, and compliance context. Experimental results show that PACT's fine-tuned model improves recall@1 from 18% to 53%, the query match rate from 9.6% to 69.7% when paired with a baseline AI agent, and the hitrate@1 from 25.7% to 44.9% for candidate selection in a standard recommender system.