AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

A Cascaded Architecture for Extractive Summarization of Multimedia Content via Audio-to-Text Alignment

Hossain, Tanzir, Islam, Ar-Rafi, Hossain, Md. Sabbir, Rasel, Annajiat Alim

arXiv.org Artificial IntelligenceApr-10-2025

This study presents a cascaded architecture for extractive summarization of multimedia content via audio-to-text alignment. The proposed framework addresses the challenge of extracting key insights from multimedia sources like YouTube videos. It integrates audio-to-text conversion using Microsoft Azure Speech with advanced extractive summarization models, including Whisper, Pegasus, and Facebook BART XSum. The system employs tools such as Pytube, Pydub, and SpeechRecognition for content retrieval, audio extraction, and transcription. Linguistic analysis is enhanced through named entity recognition and semantic role labeling. Evaluation using ROUGE and F1 scores demonstrates that the cascaded architecture outperforms conventional summarization methods, despite challenges like transcription errors. Future improvements may include model fine-tuning and real-time processing. This study contributes to multimedia summarization by improving information retrieval, accessibility, and user experience.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.06275

Country: Asia > Bangladesh (0.15)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

RAVEN: An Agentic Framework for Multimodal Entity Discovery from Large-Scale Video Collections

Rosa, Kevin Dela

arXiv.org Artificial IntelligenceApr-10-2025

We present RA VEN ( R ecognition and A daptation of Video ENtities), an adaptive AI agent framework designed for mul-timodal entity discovery and retrieval in large-scale video collections. Synthesizing information across visual, audio, and textual modalities, RA VEN autonomously processes video data to produce structured, actionable representations for downstream tasks. Key contributions include (1) a category understanding step to infer video themes and general-purpose entities, (2) a schema generation mechanism that dynamically defines domain-specific entities and attributes, and (3) a rich entity extraction process that leverages semantic retrieval and schema-guided prompting. RA VEN is designed to be model-agnostic, allowing the integration of different vision-language models (VLMs) and large language models (LLMs) based on application-specific requirements. This flexibility supports diverse applications in personalized search, content discovery, and scalable information retrieval, enabling practical applications across vast datasets.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2504.06272

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.90)

Add feedback

MicroNN: An On-device Disk-resident Updatable Vector Database

Pound, Jeffrey, Chabert, Floris, Bhushan, Arjun, Goswami, Ankur, Pacaci, Anil, Chowdhury, Shihabur Rahman

arXiv.org Artificial IntelligenceApr-9-2025

Nearest neighbour search over dense vector collections has important applications in information retrieval, retrieval augmented generation (RAG), and content ranking. Performing efficient search over large vector collections is a well studied problem with many existing approaches and open source implementations. However, most state-of-the-art systems are generally targeted towards scenarios using large servers with an abundance of memory, static vector collections that are not updatable, and nearest neighbour search in isolation of other search criteria. We present Micro Nearest Neighbour (MicroNN), an embedded nearest-neighbour vector search engine designed for scalable similarity search in low-resource environments. MicroNN addresses the problem of on-device vector search for real-world workloads containing updates and hybrid search queries that combine nearest neighbour search with structured attribute filters. In this scenario, memory is highly constrained and disk-efficient index structures and algorithms are required, as well as support for continuous inserts and deletes. MicroNN is an embeddable library that can scale to large vector collections with minimal resources. MicroNN is used in production and powers a wide range of vector search use-cases on-device. MicroNN takes less than 7 ms to retrieve the top-100 nearest neighbours with 90% recall on publicly available million-scale vector benchmark while using ~10 MB of memory.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.05573

Country: North America > United States > New York (0.14)

Genre: Research Report (0.50)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Dr Web: a modern, query-based web data retrieval engine

Prifti, Ylli, Provetti, Alessandro, de Meo, Pasquale

arXiv.org Artificial IntelligenceApr-9-2025

Counters are generally in the form of users, number of pages, number of websites, number of tweets, etc. In reality, it is a non-trivial quest to determine the memory size of the internet. The situation becomes more challenging if we consider the deep web, which is usually estimated to be much larger than the visible web. Nevertheless, the indeterministic characteristic of the memory size of the internet, the number is bound to be large and ever-growing. The amount of data presents unprecedented opportunities for data mining and information extraction from the web. This has proven to be true given the number of scientific papers and research based on data from the web. However, the web is unstructured. Previous tentatives to apply a machine-readable structure [1] to the web have failed to become large-scale standards.

data mining, engine, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.05311

Country: North America > United States (0.68)

Genre: Research Report (0.50)

Industry: Information Technology > Services (0.69)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Web (1.00)
Information Technology > Communications > Social Media (1.00)
(2 more...)

Add feedback

Opioid Named Entity Recognition (ONER-2025) from Reddit

Ahmad, Muhammad, Farid, Humaira, Ameer, Iqra, Amjad, Maaz, Muzamil, Muhammad, Hamza, Ameer, Jalal, Muhammad, Batyrshin, Ildar, Sidorov, Grigori

arXiv.org Artificial IntelligenceApr-5-2025

The opioid overdose epidemic remains a critical public health crisis, particularly in the United States, leading to significant mortality and societal costs. Social media platforms like Reddit provide vast amounts of unstructured data that offer insights into public perceptions, discussions, and experiences related to opioid use. This study leverages Natural Language Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to extract actionable information from these platforms. Our research makes four key contributions. First, we created a unique, manually annotated dataset sourced from Reddit, where users share self-reported experiences of opioid use via different administration routes. This dataset contains 331,285 tokens and includes eight major opioid entity categories. Second, we detail our annotation process and guidelines while discussing the challenges of labeling the ONER-2025 dataset. Third, we analyze key linguistic challenges, including slang, ambiguity, fragmented sentences, and emotionally charged language, in opioid discussions. Fourth, we propose a real-time monitoring system to process streaming data from social media, healthcare records, and emergency services to identify overdose events. Using 5-fold cross-validation in 11 experiments, our system integrates machine learning, deep learning, and transformer-based language models with advanced contextual embeddings to enhance understanding. Our transformer-based models (bert-base-NER and roberta-base) achieved 97% accuracy and F1-score, outperforming baselines by 10.23% (RF=0.88).

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2504.00027

Country:

North America > Mexico > Mexico City > Mexico City (0.04)
South America (0.04)
North America > United States > Texas > Lubbock County > Lubbock (0.04)
(6 more...)

Genre: Research Report > New Finding (0.69)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Addiction Disorder (1.00)
Health & Medicine > Public Health (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Microsoft releases its own AI search engine, called Copilot Search

PCWorldApr-3-2025, 18:12:39 GMT

Artificial intelligence has basically taken over and replace traditional web search engines. You've already seen it with AI overviews in Google Search, followed up with OpenAI going the way of SearchGPT. Even alternative search engines like DuckDuckGo are starting to incorporate AI into their platforms, and things aren't slowing down. Well, now we can add another to the pile: Microsoft just released Copilot Search, which is sort of like an AI-infused Bing Search. It takes in data from sources all over the web, then uses Copilot's AI powers to synthesize a summary for you.

copilot search, own ai search engine, search engine, (4 more...)

PCWorld

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.96)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.42)

Add feedback

Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models

Xu, Yilong, Gao, Jinhua, Yu, Xiaoming, Xue, Yuanhai, Bi, Baolong, Shen, Huawei, Cheng, Xueqi

arXiv.org Artificial IntelligenceApr-1-2025

Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provides valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility for better task generalization. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2504.00573

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > Dominican Republic (0.04)
(7 more...)

Genre:

Overview (0.68)
Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)

Add feedback

Uncovering the Limitations of Query Performance Prediction: Failures, Insights, and Implications for Selective Query Processing

Chifu, Adrian-Gabriel, Déjean, Sébastien, Mothe, Josiane, Garouani, Moncef, Ortiz, Diego, Ullah, Md Zia

arXiv.org Artificial IntelligenceApr-1-2025

Query Performance Prediction (QPP) estimates retrieval systems effectiveness for a given query, offering valuable insights for search effectiveness and query processing. Despite extensive research, QPPs face critical challenges in generalizing across diverse retrieval paradigms and collections. This paper provides a comprehensive evaluation of state-of-the-art QPPs (e.g. NQC, UQC), LETOR-based features, and newly explored dense-based predictors. Using diverse sparse rankers (BM25, DFree without and with query expansion) and hybrid or dense (SPLADE and ColBert) rankers and diverse test collections ROBUST, GOV2, WT10G, and MS MARCO; we investigate the relationships between predicted and actual performance, with a focus on generalization and robustness. Results show significant variability in predictors accuracy, with collections as the main factor and rankers next. Some sparse predictors perform somehow on some collections (TREC ROBUST and GOV2) but do not generalise to other collections (WT10G and MS-MARCO). While some predictors show promise in specific scenarios, their overall limitations constrain their utility for applications. We show that QPP-driven selective query processing offers only marginal gains, emphasizing the need for improved predictors that generalize across collections, align with dense retrieval architectures and are useful for downstream applications.

artificial intelligence, natural language, ranker, (12 more...)

arXiv.org Artificial Intelligence

2504.01101

Country:

Europe > France > Occitanie > Haute-Garonne > Toulouse (0.05)
North America > United States (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.35)

Add feedback

Collaborative LLM Numerical Reasoning with Local Data Protection

Zhang, Min, Lu, Yuzhe, Zhou, Yun, Xu, Panpan, Cheong, Lin Lee, Lu, Chang-Tien, Wang, Haozhu

arXiv.org Artificial IntelligenceMar-31-2025

Numerical reasoning over documents, which demands both contextual understanding and logical inference, is challenging for low-capacity local models deployed on computation-constrained devices. Although such complex reasoning queries could be routed to powerful remote models like GPT-4, exposing local data raises significant data leakage concerns. Existing mitigation methods generate problem descriptions or examples for remote assistance. However, the inherent complexity of numerical reasoning hinders the local model from generating logically equivalent queries and accurately inferring answers with remote guidance. In this paper, we present a model collaboration framework with two key innovations: (1) a context-aware synthesis strategy that shifts the query domains while preserving logical consistency; and (2) a tool-based answer reconstruction approach that reuses the remote-generated problem-solving pattern with code snippets. Experimental results demonstrate that our method achieves better reasoning accuracy than solely using local models while providing stronger data protection than fully relying on remote models. Furthermore, our method improves accuracy by 16.2% - 43.6% while reducing data leakage by 2.3% - 44.6% compared to existing data protection approaches.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2504.00299

Country:

Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States > Virginia (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Add feedback

Bias-Aware Agent: Enhancing Fairness in AI-Driven Knowledge Retrieval

Singh, Karanbir, Ngu, William

arXiv.org Artificial IntelligenceMar-27-2025

Advancements in retrieving accessible information have evolved faster in the last few years compared to the decades since the internet's creation. Search engines, like Google, have been the number one way to find relevant data. They have always relied on the user's abilities to find the best information in its billions of links and sources at everybody's fingertips. The advent of large language models (LLMs) has completely transformed the field of information retrieval. The LLMs excel not only at retrieving relevant knowledge but also at summarizing it effectively, making information more accessible and consumable for users. On top of it, the rise of AI Agents has introduced another aspect to information retrieval i.e. dynamic information retrieval which enables the integration of real-time data such as weather forecasts, and financial data with the knowledge base to curate context-aware knowledge. However, despite these advancements the agents remain susceptible to issues of bias and fairness, challenges deeply rooted within the knowledge base and training of LLMs. This study introduces a novel approach to bias-aware knowledge retrieval by leveraging agentic framework and the innovative use of bias detectors as tools to identify and highlight inherent biases in the retrieved content. By empowering users with transparency and awareness, this approach aims to foster more equitable information systems and promote the development of responsible AI.

information retrieval, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3701716.3716885

2503.21237

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > Australia > New South Wales > Sydney (0.06)
Asia > China > Hubei Province > Wuhan (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback