AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

Fleshman, William, Van Durme, Benjamin

arXiv.org Artificial IntelligenceJun-20-2024

Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.

dataset, re-a dapt ir, retrieval model, (15 more...)

arXiv.org Artificial Intelligence

2406.14764

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Learned Graph Rewriting with Equality Saturation: A New Paradigm in Relational Query Rewrite and Beyond

Bărbulescu, George-Octavian, Wang, Taiyi, Singh, Zak, Yoneki, Eiko

arXiv.org Artificial IntelligenceJun-19-2024

Query rewrite systems perform graph substitutions using rewrite rules to generate optimal SQL query plans. Rewriting logical and physical relational query plans is proven to be an NP-hard sequential decision-making problem with a search space exponential in the number of rewrite rules. In this paper, we address the query rewrite problem by interleaving Equality Saturation and Graph Reinforcement Learning (RL). The proposed system, Aurora, rewrites relational queries by guiding Equality Saturation, a method from compiler literature to perform non-destructive graph rewriting, with a novel RL agent that embeds both the spatial structure of the query graph as well as the temporal dimension associated with the sequential construction of query plans. Our results show Graph Reinforcement Learning for non-destructive graph rewriting yields SQL plans orders of magnitude faster than existing equality saturation solvers, while also achieving competitive results against mainstream query optimisers.

equality saturation, query plan, saturation, (12 more...)

arXiv.org Artificial Intelligence

2407.12794

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Japan (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Data Collection and Labeling Techniques for Machine Learning

Huang, Qianyu, Zhao, Tongfang

arXiv.org Artificial IntelligenceJun-19-2024

This remarkable advancement can be attributed to two key factors: the exponential rise in computational power and the ever-increasing availability of vast datasets [1-3]. However, the very foundation upon which this progress rests-data collection and labeling-presents significant challenges that can hinder the efficacy and ethical implementation of ML models[4-8]. This review paper delves into the intricate world of data collection and labeling for machine learning, drawing upon insights from both the data management and machine learning communities. The transformative potential of machine learning is evident across a multitude of domains. From revolutionizing healthcare with disease diagnosis and personalized medicine[9] to powering selfdriving cars[10] and optimizing logistics in supply chains[11], ML algorithms are rapidly reshaping our world. At the heart of these advancements lies the ability of ML models to learn from data, identify patterns, and make predictions based on the information they have been exposed to. The quality and quantity of data used to train these models are paramount to their success. High-quality, diverse, and well-labeled data are essential for building robust and generalizable ML models that can perform effectively in real-world scenarios [12, 13].

dataset, learning, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2407.12793

Country:

North America > United States > New York > New York County > New York City (0.05)
Asia > Taiwan > Taiwan Province > Taipei (0.04)

Genre:

Overview (1.00)
Research Report (0.82)

Industry:

Health & Medicine (1.00)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Prediction of the Realisation of an Information Need: An EEG Study

McGuire, Niall, Moshfeghi, Dr Yashar

arXiv.org Artificial IntelligenceJun-18-2024

One of the foundational goals of Information Retrieval (IR) is to satisfy searchers' Information Needs (IN). Understanding how INs physically manifest has long been a complex and elusive process. However, recent studies utilising Electroencephalography (EEG) data have provided real-time insights into the neural processes associated with INs. Unfortunately, they have yet to demonstrate how this insight can practically benefit the search experience. As such, within this study, we explore the ability to predict the realisation of IN within EEG data across 14 subjects whilst partaking in a Question-Answering (Q/A) task. Furthermore, we investigate the combinations of EEG features that yield optimal predictive performance, as well as identify regions within the Q/A queries where a subject's realisation of IN is more pronounced. The findings from this work demonstrate that EEG data is sufficient for the real-time prediction of the realisation of an IN across all subjects with an accuracy of 73.5% (SD 2.6%) and on a per-subject basis with an accuracy of 90.1% (SD 22.1%). This work helps to close the gap by bridging theoretical neuroscientific advancements with tangible improvements in information retrieval practices, paving the way for real-time prediction of the realisation of IN.

information, prediction, realisation, (11 more...)

arXiv.org Artificial Intelligence

2406.08105

Country:

North America > United States > District of Columbia > Washington (0.05)
Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.91)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.71)

Add feedback

PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval

Huynh, Tuan-Luc, Vu, Thuy-Trang, Wang, Weiqing, Wei, Yinwei, Le, Trung, Gasevic, Dragan, Li, Yuan-Fang, Do, Thanh-Toan

arXiv.org Artificial IntelligenceJun-18-2024

Differentiable Search Index (DSI) utilizes Pre-trained Language Models (PLMs) for efficient document retrieval without relying on external indexes. However, DSIs need full re-training to handle updates in dynamic corpora, causing significant computational inefficiencies. We introduce PromptDSI, a rehearsal-free, prompt-based approach for instance-wise incremental learning in document retrieval. PromptDSI attaches prompts to the frozen PLM's encoder of DSI, leveraging its powerful representation to efficiently index new corpora while maintaining a balance between stability and plasticity. We eliminate the initial forward pass of prompt-based continual learning methods that doubles training and inference time. Moreover, we propose a topic-aware prompt pool that employs neural topic embeddings as fixed keys. This strategy ensures diverse and effective prompt usage, addressing the challenge of parameter underutilization caused by the collapse of the query-key matching mechanism. Our empirical evaluations demonstrate that PromptDSI matches IncDSI in managing forgetting while significantly enhancing recall by over 4% on new corpora.

learning, promptdsi, retrieval, (16 more...)

arXiv.org Artificial Intelligence

2406.12593

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(7 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (0.93)
Leisure & Entertainment > Sports > Football (0.46)
Education > Educational Setting > Online (0.46)
Leisure & Entertainment > Sports > Soccer (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels

Xian, Jasper, Samuel, Saron, Khoubsirat, Faraz, Pradeep, Ronak, Sultan, Md Arafat, Florian, Radu, Roukos, Salim, Sil, Avirup, Potts, Christopher, Khattab, Omar

arXiv.org Artificial IntelligenceJun-17-2024

We develop a method for training small-scale (under 100M parameter) neural information retrieval models with as few as 10 gold relevance labels. The method depends on generating synthetic queries for documents using a language model (LM), and the key step is that we automatically optimize the LM prompt that is used to generate these queries based on training quality. In experiments with the BIRCO benchmark, we find that models trained with our method outperform RankZephyr and are competitive Figure 1: An overview of the PATH pipeline for training with RankLLama, both of which are 7B parameter a reranker with synthetic queries. A user only needs to models trained on over 100K labels. These input a prompt with the task description and as few as findings point to the power of automatic prompt 10 relevance judgements to achieve strong results.

query, reranker, retrieval, (14 more...)

arXiv.org Artificial Intelligence

2406.11706

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Add feedback

Fusion Makes Perfection: An Efficient Multi-Grained Matching Approach for Zero-Shot Relation Extraction

Li, Shilong, Bai, Ge, Zhang, Zhang, Liu, Ying, Lu, Chenji, Guo, Daichi, Liu, Ruifang, Sun, Yong

arXiv.org Artificial IntelligenceJun-17-2024

Predicting unseen relations that cannot be observed during the training phase is a challenging task in relation extraction. Previous works have made progress by matching the semantics between input instances and label descriptions. However, fine-grained matching often requires laborious manual annotation, and rich interactions between instances and label descriptions come with significant computational overhead. In this work, we propose an efficient multi-grained matching approach that uses virtual entity matching to reduce manual annotation cost, and fuses coarse-grained recall and fine-grained classification for rich interactions with guaranteed inference speed. Experimental results show that our approach outperforms the previous State Of The Art (SOTA) methods, and achieves a balance between inference efficiency and prediction accuracy in zero-shot relation extraction tasks. Our code is available at https://github.com/longls777/EMMA.

classification, classification model, relation, (15 more...)

arXiv.org Artificial Intelligence

2406.11429

Country:

North America > Dominican Republic (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
North America > United States > California (0.04)
(3 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.62)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis

Ciancone, Mathieu, Kerboua, Imene, Schaeffer, Marion, Siblini, Wissam

arXiv.org Artificial IntelligenceJun-17-2024

Recently, numerous embedding models have been made available and widely used for various NLP tasks. The Massive Text Embedding Benchmark (MTEB) has primarily simplified the process of choosing a model that performs well for several tasks in English, but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. We gather 15 existing datasets in an easy-to-use interface and create three new French datasets for a global evaluation of 8 task categories. We compare 51 carefully selected embedding models on a large scale, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well. Our work comes with open-source code, new datasets and a public leaderboard.

benchmark, dataset, similarity, (15 more...)

arXiv.org Artificial Intelligence

2405.20468

Country:

Europe > France (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Mexico (0.04)
(15 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Khmer Semantic Search Engine (KSE): Digital Information Access and Document Retrieval

Thuon, Nimol

arXiv.org Artificial IntelligenceJun-16-2024

The search engine process is crucial for document content retrieval. For Khmer documents, an effective tool is needed to extract essential keywords and facilitate accurate searches. Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents due to the lack of an effective semantic searching tool. Even Google does not deliver high accuracy for Khmer content. Semantic search engines improve search results by employing advanced algorithms to understand various content types. With the rise in Khmer digital content such as reports, articles, and social media feedback enhanced search capabilities are essential. This research proposes the first Khmer Semantic Search Engine (KSE), designed to enhance traditional Khmer search methods. Utilizing semantic matching techniques and formally annotated semantic content, our tool extracts meaningful keywords from user queries, performs precise matching, and provides the best matching offline documents and online URLs. We propose three semantic search frameworks: semantic search based on a keyword dictionary, semantic search based on ontology, and semantic search based on ranking. Additionally, we developed tools for data preparation, including document addition and manual keyword extraction. To evaluate performance, we created a ground truth dataset and addressed issues related to searching and semantic search. Our findings demonstrate that understanding search term semantics can lead to significantly more accurate results.

keyword, search engine, search result, (8 more...)

arXiv.org Artificial Intelligence

2406.0932

Country:

Asia > Cambodia > Phnom Penh Province > Phnom Penh (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Belgium (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry: Consumer Products & Services > Travel (0.30)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

Xu, Haike, Lin, Zongyu, Sun, Yizhou, Chang, Kai-Wei, Indyk, Piotr

arXiv.org Artificial IntelligenceJun-15-2024

Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.

contradiction, parse cl, retrieval, (15 more...)

arXiv.org Artificial Intelligence

2406.10746

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Middle East > UAE (0.05)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(4 more...)

Genre:

Research Report (0.70)
Overview (0.48)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback