AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval

Lin, Hao, Xie, Peitong, Chen, Jingxue, Lin, Jie, Tang, Qingkun, Lu, Qianchun

arXiv.org Artificial IntelligenceSep-9-2025

Retrieval-Augmented Generation (RAG) systems rely heavily on the retrieval stage, particularly the coarse-ranking process. Existing coarse-ranking optimization approaches often struggle to balance domain-specific knowledge learning with query enhencement, resulting in suboptimal retrieval performance. To address this challenge, we propose MoLER, a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval. MoLER has a two-stage pipeline: a continual pre-training (CPT) phase using a Mixture of Losses (MoL) to balance domain-specific knowledge with general language capabilities, and a reinforcement learning (RL) phase leveraging Group Relative Policy Optimization (GRPO) to optimize query and passage generation for maximizing document recall. A key innovation is our Multi-query Single-passage Late Fusion (MSLF) strategy, which reduces computational overhead during RL training while maintaining scalable inference via Multi-query Multi-passage Late Fusion (MMLF). Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.

large language model, machine learning, retrieval performance, (23 more...)

arXiv.org Artificial Intelligence

2509.0665

Country:

Asia (0.69)
North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval

Bui, Chi Minh, Thieu, Ngoc Mai, Nguyen, Van Vinh, Jung, Jason J., Bui, Khac-Hoai Nam

arXiv.org Artificial IntelligenceSep-9-2025

The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR's superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.20417

Country:

North America > United States (1.00)
Asia (1.00)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.34)

Industry:

Leisure & Entertainment > Sports > Football (0.68)
Law > Business Law > Antitrust Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Language Bias in Information Retrieval: The Nature of the Beast and Mitigation Methods

Yang, Jinrui, Jiang, Fan, Baldwin, Timothy

arXiv.org Artificial IntelligenceSep-9-2025

Language fairness in multilingual information retrieval (MLIR) systems is crucial for ensuring equitable access to information across diverse languages. This paper sheds light on the issue, based on the assumption that queries in different languages, but with identical semantics, should yield equivalent ranking lists when retrieving on the same multilingual documents. We evaluate the degree of fairness using both traditional retrieval methods, and a DPR neural ranker based on mBERT and XLM-R. Additionally, we introduce `LaKDA', a novel loss designed to mitigate language biases in neural MLIR approaches. Our analysis exposes intrinsic language biases in current MLIR technologies, with notable disparities across the retrieval methods, and the effectiveness of LaKDA in enhancing language fairness.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2509.06195

Country:

Europe > Spain (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

Venkit, Pranav Narayanan, Laban, Philippe, Zhou, Yilun, Huang, Kung-Hsiang, Mao, Yixin, Wu, Chien-Sheng

arXiv.org Artificial IntelligenceSep-8-2025

Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.

large language model, machine learning, research agent, (17 more...)

arXiv.org Artificial Intelligence

2509.04499

Country: North America > Mexico (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Ontology-Aligned Embeddings for Data-Driven Labour Market Analytics

Hihn, Heinke, Dittrich, Dennis A. V., Jeske, Carl, Sobral, Cayo Costa, Pais, Helio, Lochmann, Timm

arXiv.org Artificial IntelligenceSep-8-2025

The limited ability to reason across occupational data from different sources is a long-standing bottleneck for data-driven labour market analytics. Previous research has relied on hand-crafted ontologies that allow such reasoning but are computationally expensive and require careful maintenance by human experts. The rise of language processing machine learning models offers a scalable alternative by learning shared semantic spaces that bridge diverse occupational vocabularies without extensive human curation. We present an embedding-based alignment process that links any free-form German job title to two established ontologies - the German Klassifikation der Berufe and the International Standard Classification of Education. Using publicly available data from the German Federal Employment Agency, we construct a dataset to fine-tune a Sentence-BERT model to learn the structure imposed by the ontologies. The enriched pairs (job title, embedding) define a similarity graph structure that we can use for efficient approximate nearest-neighbour search, allowing us to frame the classification process as a semantic search problem. This allows for greater flexibility, e.g., adding more classes. We discuss design decisions, open challenges, and outline ongoing work on extending the graph with other ontologies and multilingual titles.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.04942

Country: Europe > Germany (1.00)

Genre: Research Report (1.00)

Industry:

Banking & Finance (1.00)
Education > Educational Setting > Higher Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Evaluating NL2SQL via SQL2NL

Safarzadeh, Mohammadtaher, Oroojlooyjadid, Afshin, Roth, Dan

arXiv.org Artificial IntelligenceSep-8-2025

Robust evaluation in the presence of linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks rarely address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations. Our analysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models (e.g., GPT-4o mini) are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain -- highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.04657

Country: Europe (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.76)
(2 more...)

Add feedback

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Ayaou, Iliass, Cavallucci, Denis, Chibane, Hicham

arXiv.org Artificial IntelligenceSep-5-2025

Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms document-level, and dense methods provide modest gains over BM25, but none close the OUT-domain gap. Document-level RRF yields strong effectiveness efficiency trade-offs with minimal overhead. By exposing the persistent challenge of cross-domain retrieval, DAPFAM provides a reproducible, compute-aware testbed for developing more robust patent IR systems. The dataset is publicly available on huggingface at https://huggingface.co/datasets/datalyes/DAPFAM_patent.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.22141

Country:

Europe (0.67)
North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Law > Intellectual Property & Technology Law (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Enhancing Technical Documents Retrieval for RAG

Lai, Songjiang, Cheung, Tsun-Hin, Fung, Ka-Chun, Xue, Kaiwen, Lin, Kwan-Ho, Choi, Yan-Ming, Ng, Vincent, Lam, Kin-Man

arXiv.org Artificial IntelligenceSep-5-2025

In this paper, we introduce Technical-Embeddings, a novel framework designed to optimize semantic retrieval in technical documentation, with applications in both hardware and software development. Our approach addresses the challenges of understanding and retrieving complex technical content by leveraging the capabilities of Large Language Models (LLMs). First, we enhance user queries by generating expanded representations that better capture user intent and improve dataset diversity, thereby enriching the fine-tuning process for embedding models. Second, we apply summary extraction techniques to encode essential contextual information, refining the representation of technical documents. To further enhance retrieval performance, we fine-tune a bi-encoder BERT model using soft prompting, incorporating separate learning parameters for queries and document context to capture fine-grained semantic nuances. We evaluate our approach on two public datasets, RAG-EDA and Rust-Docs-QA, demonstrating that Technical-Embeddings significantly outperforms baseline models in both precision and recall. Our findings highlight the effectiveness of integrating query expansion and contextual summarization to enhance information access and comprehension in technical domains. This work advances the state of Retrieval-Augmented Generation (RAG) systems, offering new avenues for efficient and accurate technical document retrieval in engineering and product development workflows.

information retrieval, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2509.04139

Country: Asia > China (0.15)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Deep Research Agents: A Systematic Examination And Roadmap

Huang, Yuxuan, Chen, Yihang, Zhang, Haozheng, Li, Kang, Zhou, Huichi, Fang, Meng, Yang, Linyi, Li, Xiaoguang, Shang, Lifeng, Xu, Songcen, Hao, Jianye, Shao, Kun, Wang, Jun

arXiv.org Artificial IntelligenceSep-4-2025

The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: {https://github.com/ai-agents-2030/awesome-deep-research-agent}.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2506.18096

Genre:

Overview (1.00)
Research Report (0.82)
Workflow (0.69)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics

Russo, Matthew, Kraska, Tim

arXiv.org Artificial IntelligenceSep-4-2025

With advances in large language models (LLMs), researchers are creating new systems that can perform AI-driven analytics over large unstructured datasets. Recent work has explored executing such analytics queries using semantic operators -- a declarative set of AI-powered data transformations with natural language specifications. However, even when optimized, these operators can be expensive to execute on millions of records and their iterator execution semantics make them ill-suited for interactive data analytics tasks. In another line of work, Deep Research systems have demonstrated an ability to answer natural language question(s) over large datasets. These systems use one or more LLM agent(s) to plan their execution, process the dataset(s), and iteratively refine their answer. However, these systems do not explicitly optimize their query plans which can lead to poor plan execution. In order for AI-driven analytics to excel, we need a runtime which combines the optimized execution of semantic operators with the flexibility and more dynamic execution of Deep Research systems. As a first step towards this vision, we build a prototype which enables Deep Research agents to write and execute optimized semantic operator programs. We evaluate our prototype and demonstrate that it can outperform a handcrafted semantic operator program and open Deep Research systems on two basic queries. Compared to a standard open Deep Research agent, our prototype achieves up to 1.95x better F1-score. Furthermore, even if we give the agent access to semantic operators as tools, our prototype still achieves cost and runtime savings of up to 76.8% and 72.7% thanks to its optimized execution.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.02751

Country:

North America > United States > California (0.28)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.41)

Industry: Information Technology (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.90)

Add feedback