information retrieval system
A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
Rayo, Jhon, de la Rosa, Raul, Garrido, Mario
Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.
Leveraging Translation For Optimal Recall: Tailoring LLM Personalization With User Profiles
Ravichandran, Karthik, Gomasta, Sarmistha Sarna
This paper explores a novel technique for improving recall in cross-language information retrieval (CLIR) systems using iterative query refinement grounded in the user's lexical-semantic space. The proposed methodology combines multi-level translation, semantic embedding-based expansion, and user profile-centered augmentation to address the challenge of matching variance between user queries and relevant documents. Through an initial BM25 retrieval, translation into intermediate languages, embedding lookup of similar terms, and iterative re-ranking, the technique aims to expand the scope of potentially relevant results personalized to the individual user. Comparative experiments on news and Twitter datasets demonstrate superior performance over baseline BM25 ranking for the proposed approach across ROUGE metrics. The translation methodology also showed maintained semantic accuracy through the multi-step process. This personalized CLIR framework paves the path for improved context-aware retrieval attentive to the nuances of user language.
Harnessing Retrieval-Augmented Generation (RAG) for Uncovering Knowledge Gaps
The paper presents a methodology for uncovering knowledge gaps on the internet using the Retrieval Augmented Generation (RAG) model. By simulating user search behaviour, the RAG system identifies and addresses gaps in information retrieval systems. The study demonstrates the effectiveness of the RAG system in generating relevant suggestions with a consistent accuracy of 93%. The methodology can be applied in various fields such as scientific discovery, educational enhancement, research development, market analysis, search engine optimisation, and content development. The results highlight the value of identifying and understanding knowledge gaps to guide future endeavours.
A Brief History of Prompt: Leveraging Language Models. (Through Advanced Prompting)
This paper presents a comprehensive exploration of the evolution of prompt engineering and generation in the field of natural language processing (NLP). Starting from the early language models and information retrieval systems, we trace the key developments that have shaped prompt engineering over the years. The introduction of attention mechanisms in 2015 revolutionized language understanding, leading to advancements in controllability and context-awareness. Subsequent breakthroughs in reinforcement learning techniques further enhanced prompt engineering, addressing issues like exposure bias and biases in generated text. We examine the significant contributions in 2018 and 2019, focusing on fine-tuning strategies, control codes, and template-based generation. The paper also discusses the growing importance of fairness, human-AI collaboration, and low-resource adaptation. In 2020 and 2021, contextual prompting and transfer learning gained prominence, while 2022 and 2023 witnessed the emergence of advanced techniques like unsupervised pre-training and novel reward shaping. Throughout the paper, we reference specific research studies that exemplify the impact of various developments on prompt engineering. The journey of prompt engineering continues, with ethical considerations being paramount for the responsible and inclusive future of AI systems.
Testing different Log Bases For Vector Model Weighting Technique
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
User Simulation for Evaluating Information Access Systems
Balog, Krisztian, Zhai, ChengXiang
Information access systems, such as search engines, recommender systems, and conversational assistants, have become integral to our daily lives as they help us satisfy our information needs. However, evaluating the effectiveness of these systems presents a long-standing and complex scientific challenge. This challenge is rooted in the difficulty of assessing a system's overall effectiveness in assisting users to complete tasks through interactive support, and further exacerbated by the substantial variation in user behaviour and preferences. To address this challenge, user simulation emerges as a promising solution. This book focuses on providing a thorough understanding of user simulation techniques designed specifically for evaluation purposes. We begin with a background of information access system evaluation and explore the diverse applications of user simulation. Subsequently, we systematically review the major research progress in user simulation, covering both general frameworks for designing user simulators, utilizing user simulation for evaluation, and specific models and algorithms for simulating user interactions with search engines, recommender systems, and conversational assistants. Realizing that user simulation is an interdisciplinary research topic, whenever possible, we attempt to establish connections with related fields, including machine learning, dialogue systems, user modeling, and economics. We end the book with a detailed discussion of important future research directions, many of which extend beyond the evaluation of information access systems and are expected to have broader impact on how to evaluate interactive intelligent systems in general.
Alloprof: a new French question-answer education dataset and its use in an information retrieval case study
Lefebvre-Brossard, Antoine, Gazaille, Stephane, Desmarais, Michel C.
Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.
Search engines and AI will make each other better
"Is the U.S. flag still on the moon?" I asked ChatGPT, the viral AI chatbot. "The United States flag is no longer on the moon," responded ChatGPT. It went on to explain why the flag couldn't survive the harsh conditions of the lunar surface, and confidently claimed that the last time an American flag was planted on the moon was during the Apollo 11 mission in 1969, after which no humans returned to the moon. In reality, though the United States' lunar flags have been gradually disintegrating, five still stand today -- except the one from Apollo 11, which was knocked over by the engine exhaust as the spacecraft lifted off.
UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language
Sharipov, Maksud, Yuldashov, Ollabergan
In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.