AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Neural Information Processing SystemsMay-27-2025, 05:30:52 GMT

Self-Retrieval: End-to-End Information Retrieval with One Large Language Model

The rise of large language models (LLMs) has significantly transformed both the construction and application of information retrieval (IR) systems. However, current interactions between IR systems and LLMs remain limited, with LLMs merely serving as part of components within IR systems, and IR systems being constructed independently of LLMs. This separated architecture restricts knowledge sharing and deep collaboration between them.In this paper, we introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval architecture.Self-Retrieval unifies all essential IR functions within a single LLM, leveraging the inherent capabilities of LLMs throughout the IR process.Specifically, Self-Retrieval internalizes the retrieval corpus through self-supervised learning, transforms the retrieval process into sequential passage generation, and performs relevance assessment for reranking.Experimental results demonstrate that Self-Retrieval not only outperforms existing retrieval approaches by a significant margin, but also substantially enhances the performance of LLM-driven downstream applications like retrieval-augmented generation.

end-to-end information retrieval, language model, self-retrieval, (3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Neural Information Processing SystemsMay-27-2025, 04:36:36 GMT

Navigable Graphs for High-Dimensional Nearest Neighbor Search: Constructions and Limits

There has been significant recent interest in graph-based nearest neighbor search methods, many of which are centered on the construction of (approximately) "navigable" graphs over high-dimensional point sets. A graph is navigable if we can successfully move from any starting node to any target node using a greedy routing strategy where we always move to the neighbor that is closest to the destination according to the given distance function. The complete graph is obviously navigable for any point set, but the important question for applications is if sparser graphs can be constructed. While this question is fairly well understood in low-dimensions, we establish some of the first upper and lower bounds for high-dimensional point sets. First, we give a simple and efficient way to construct a navigable graph with average degree O(\sqrt{n \log n }) for any set of n points, in any dimension, for any distance function.

artificial intelligence, information retrieval, natural language, (9 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.64)

LogiCoL: Logically-Informed Contrastive Learning for Set-based Dense Retrieval

Shen, Yanzhen, Chen, Sihao, Xu, Xueqiang, Zhang, Yunyi, Malaviya, Chaitanya, Roth, Dan

While significant progress has been made with dual- and bi-encoder dense retrievers, they often struggle on queries with logical connectives, a use case that is often overlooked yet important in downstream applications. Current dense retrievers struggle with such queries, such that the retrieved results do not respect the logical constraints implied in the queries. To address this challenge, we introduce LogiCoL, a logically-informed contrastive learning objective for dense retrievers. LogiCoL builds upon in-batch supervised contrastive learning, and learns dense retrievers to respect the subset and mutually-exclusive set relation between query results via two sets of soft constraints expressed via t-norm in the learning objective. We evaluate the effectiveness of LogiCoL on the task of entity retrieval, where the model is expected to retrieve a set of entities in Wikipedia that satisfy the implicit logical constraints in the query. We show that models trained with LogiCoL yield improvement both in terms of retrieval performance and logical consistency in the results. We provide detailed analysis and insights to uncover why queries with logical connectives are challenging for dense retrievers and why LogiCoL is most effective.

artificial intelligence, machine learning, natural language, (14 more...)

2505.19588

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Jagadeeshan, Manoj Balaji, Raj, Prince, Goyal, Pawan

Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at https://huggingface.co/datasets/manojbalaji1/anveshana

information retrieval, large language model, machine learning, (19 more...)

2505.19494

Country: Asia > India (0.28)

Genre:

Research Report > New Finding (0.93)
Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
(2 more...)

Improving Ad matching via Cluster-Adaptive Keyword Expansion and Relevance tuning

Saha, Dipanwita, Zaman, Anis, Zou, Hua, Chen, Ning, Shu, Xinxin, Vase, Nadia, Bagherjeiran, Abraham

In search advertising, keyword matching connects user queries with relevant ads. While token-based matching increases ad coverage, it can reduce relevance due to overly permissive semantic expansion. This work extends keyword reach through document-side semantic keyword expansion, using a language model to broaden token-level matching without altering queries. We propose a solution using a pre-trained siamese model to generate dense vector representations of ad keywords and identify semantically related variants through nearest neighbor search. To maintain precision, we introduce a cluster-based thresholding mechanism that adjusts similarity cutoffs based on local semantic density. Each expanded keyword maps to a group of seller-listed items, which may only partially align with the original intent. To ensure relevance, we enhance the downstream relevance model by adapting it to the expanded keyword space using an incremental learning strategy with a lightweight decision tree ensemble. This system improves both relevance and click-through rate (CTR), offering a scalable, low-latency solution adaptable to evolving query behavior and advertising inventory.

information retrieval, machine learning, natural language, (17 more...)

2505.18897

Country: North America > United States (0.47)

Genre: Research Report (0.40)

Industry:

Marketing (1.00)
Information Technology > Services (0.72)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
(3 more...)

Intent Classification on Low-Resource Languages with Query Similarity Search

Bhalla, Arjun, Huang, Qi

Intent classification is an important component of a functional Information Retrieval ecosystem. Many current approaches to intent classification, typically framed as a classification problem, can be problematic as intents are often hard to define and thus data can be difficult and expensive to annotate. The problem is exacerbated when we need to extend the intent classification system to support multiple and in particular low-resource languages. To address this, we propose casting intent classification as a query similarity search problem - we use previous example queries to define an intent, and a query similarity method to classify an incoming query based on the labels of its most similar queries in latent space. With the proposed approach, we are able to achieve reasonable intent classification performance for queries in low-resource languages in a zero-shot setting.

information retrieval, machine learning, natural language, (17 more...)

2505.18241

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsMay-26-2025, 22:03:53 GMT

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded fine-grained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g.

information retrieval, large language model, natural language, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.90)

Neural Information Processing SystemsMay-26-2025, 21:01:32 GMT

UQE: A Query Engine for Unstructured Databases

Analytics on structured data is a mature field with many successful methods.However, most real world data exists in unstructured form, such as images and conversations.We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics.In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators.The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution.In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls.We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.

large language model, natural language, query engine, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.99)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

arXiv.org Artificial IntelligenceMay-23-2025

Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?

Jedidi, Nour, Chuang, Yung-Sung, Glass, James, Lin, Jimmy

With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can be integrated into passage rerankers built on Large Language Models (LLMs). These methods typically employ an LLM to produce an explicit, step-by-step reasoning process before arriving at a final relevance prediction. But, does reasoning actually improve reranking accuracy? In this paper, we dive deeper into this question, studying the impact of the reasoning process by comparing reasoning-based pointwise rerankers (ReasonRR) to standard, non-reasoning pointwise rerankers (StandardRR) under identical training conditions, and observe that StandardRR generally outperforms ReasonRR. Building on this observation, we then study the importance of reasoning to ReasonRR by disabling its reasoning process (ReasonRR-NoReason), and find that ReasonRR-NoReason is surprisingly more effective than ReasonRR. Examining the cause of this result, our findings reveal that reasoning-based rerankers are limited by the LLM's reasoning process, which pushes it toward polarized relevance scores and thus fails to consider the partial relevance of passages, a key factor for the accuracy of pointwise rerankers.

information retrieval, large language model, machine learning, (20 more...)

2505.16886

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)