AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Building Russian Benchmark for Evaluation of Information Retrieval Models

Kovalev, Grigory, Tikhomirov, Mikhail, Kozhevnikov, Evgeny, Kornilov, Max, Loukachevitch, Natalia

arXiv.org Artificial IntelligenceApr-18-2025

We introduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval.

artificial intelligence, information retrieval, natural language, (13 more...)

arXiv.org Artificial Intelligence

2504.12879

Country:

Europe > Russia (0.15)
Asia > Russia (0.15)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Zong, Yuxuan, Piwowarski, Benjamin

arXiv.org Artificial IntelligenceApr-18-2025

Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3726302.3730100

2504.12778

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.48)

Add feedback

Document Quality Scoring for Web Crawling

Pezzuti, Francesca, Mueller, Ariane, MacAvaney, Sean, Tonellotto, Nicola

arXiv.org Artificial IntelligenceApr-16-2025

The internet contains large amounts of low-quality content, yet users expect web search engines to deliver high-quality, relevant results. The abundant presence of low-quality pages can negatively impact retrieval and crawling processes by wasting resources on these documents. Therefore, search engines can greatly benefit from techniques that leverage efficient quality estimation methods to mitigate these negative impacts. Quality scoring methods for web pages are useful for many processes typical for web search systems, including static index pruning, index tiering, and crawling. Building on work by Chang et al.~\cite{chang2024neural}, who proposed using neural estimators of semantic quality for static index pruning, we extend their approach and apply their neural quality scorers to assess the semantic quality of web pages in crawling prioritisation tasks. In our experimental analysis, we found that prioritising semantically high-quality pages over low-quality ones can improve downstream search effectiveness. Our software contribution consists of a Docker container that computes an effective quality score for a given web page, allowing the quality scorer to be easily included and used in other components of web search systems.

artificial intelligence, information retrieval, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.11011

Country: Europe > Italy (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)

Add feedback

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

Rizvi, F. A., Navojith, T., Adhikari, A. M. N. H., Senevirathna, W. P. U., Kasthurirathna, Dharshana, Abeywardhana, Lakmini

arXiv.org Artificial IntelligenceApr-16-2025

Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

information retrieval, large language model, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2504.10679

Country: Asia > Sri Lanka (0.17)

Genre: Research Report (0.70)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MURR: Model Updating with Regularized Replay for Searching a Document Stream

Yang, Eugene, Tonellotto, Nicola, Lawrie, Dawn, MacAvaney, Sean, Mayfield, James, Oard, Douglas W., Miller, Scott

arXiv.org Artificial IntelligenceApr-15-2025

The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.

information retrieval, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-88708-6_6

2504.1025

Country:

North America > United States > Maryland (0.46)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.96)

Add feedback

Composable NLP Workflows for BERT-based Ranking and QA System

Kumar, Gaurav, Dandu, Murali Mohana Krishna

arXiv.org Artificial IntelligenceApr-15-2025

There has been a lot of progress towards building NLP models that scale to multiple tasks. However, real-world systems contain multiple components and it is tedious to handle cross-task interaction with varying levels of text granularity. In this work, we built an end-to-end Ranking and Question-Answering (QA) system using Forte, a toolkit that makes composable NLP pipelines. We utilized state-of-the-art deep learning models such as BERT, RoBERTa in our pipeline, evaluated the performance on MS-MARCO and Covid-19 datasets using metrics such as BLUE, MRR, F1 and compared the results of ranking and QA systems with their corresponding benchmark results. The modular nature of our pipeline and low latency of reranker makes it easy to build complex NLP applications easily.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.09398

Genre: Research Report (0.40)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.69)
Health & Medicine > Therapeutic Area > Immunology (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)

Add feedback

Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

Ye, Haotian, Jain, Himanshu, You, Chong, Suresh, Ananda Theertha, Lin, Haowei, Zou, James, Yu, Felix

arXiv.org Artificial IntelligenceApr-15-2025

In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.

information retrieval, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2504.09135

Genre: Research Report (1.00)

Industry:

Education (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)

Add feedback

II-NVM: Enhancing Map Accuracy and Consistency with Normal Vector-Assisted Mapping

Zhao, Chengwei, Li, Yixuan, Jian, Yina, Xu, Jie, Wang, Linji, Ma, Yongxin, Jin, Xinglai

arXiv.org Artificial IntelligenceApr-14-2025

SLAM technology plays a crucial role in indoor mapping and localization. A common challenge in indoor environments is the "double-sided mapping issue", where closely positioned walls, doors, and other surfaces are mistakenly identified as a single plane, significantly hindering map accuracy and consistency. To address this issue this paper introduces a SLAM approach that ensures accurate mapping using normal vector consistency. We enhance the voxel map structure to store both point cloud data and normal vector information, enabling the system to evaluate consistency during nearest neighbor searches and map updates. This process distinguishes between the front and back sides of surfaces, preventing incorrect point-to-plane constraints. Moreover, we implement an adaptive radius KD-tree search method that dynamically adjusts the search radius based on the local density of the point cloud, thereby enhancing the accuracy of normal vector calculations. To further improve realtime performance and storage efficiency, we incorporate a Least Recently Used (LRU) cache strategy, which facilitates efficient incremental updates of the voxel map. The code is released as open-source and validated in both simulated environments and real indoor scenarios. Experimental results demonstrate that this approach effectively resolves the "double-sided mapping issue" and significantly improves mapping precision. Additionally, we have developed and open-sourced the first simulation and real world dataset specifically tailored for the "double-sided mapping issue".

artificial intelligence, information retrieval, natural language, (14 more...)

arXiv.org Artificial Intelligence

2504.08204

Country:

Asia > China (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.55)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.54)

Add feedback

EqualizeIR: Mitigating Linguistic Biases in Retrieval Models

Cheng, Jiali, Amiri, Hadi

arXiv.org Artificial IntelligenceApr-11-2025

This study finds that existing information retrieval (IR) models show significant biases based on the linguistic complexity of input queries, performing well on linguistically simpler (or more complex) queries while underperforming on linguistically more complex (or simpler) queries. To address this issue, we propose EqualizeIR, a framework to mitigate linguistic biases in IR models. EqualizeIR uses a linguistically biased weak learner to capture linguistic biases in IR datasets and then trains a robust model by regularizing and refining its predictions using the biased weak learner. This approach effectively prevents the robust model from overfitting to specific linguistic patterns in data. We propose four approaches for developing linguistically-biased models. Extensive experiments on several datasets show that our method reduces performance disparities across linguistically simple and complex queries, while improving overall retrieval performance.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2504.07115

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.38)

Add feedback

Low Rank Learning for Offline Query Optimization

Yi, Zixuan, Tian, Yao, Ives, Zachary G., Marcus, Ryan

arXiv.org Artificial IntelligenceApr-10-2025

Recent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce \textsc{LimeQO}, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ``hot'' path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS. The code is available in \href{https://github.com/zixy17/LimeQO}{https://github.com/zixy17/LimeQO}.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3725412

2504.06399

Country: North America > United States > California (0.28)

Genre: Research Report (0.82)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback