AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

Haq, Saiful, Sharma, Ashutosh, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceDec-14-2023

In this paper, we introduce Neural Information Retrieval resources for 11 widely spoken Indian Languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu) from two major Indian language families (Indo-Aryan and Dravidian). These resources include (a) INDIC-MARCO, a multilingual version of the MSMARCO dataset in 11 Indian Languages created using Machine Translation, and (b) Indic-ColBERT, a collection of 11 distinct Monolingual Neural Information Retrieval models, each trained on one of the 11 languages in the INDIC-MARCO dataset. To the best of our knowledge, IndicIRSuite is the first attempt at building large-scale Neural Information Retrieval resources for a large number of Indian languages, and we hope that it will help accelerate research in Neural IR for Indian Languages. Experiments demonstrate that Indic-ColBERT achieves 47.47% improvement in the MRR@10 score averaged over the INDIC-MARCO baselines for all 11 Indian languages except Oriya, 12.26% improvement in the NDCG@10 score averaged over the MIRACL Bengali and Hindi Language baselines, and 20% improvement in the MRR@100 Score over the Mr.Tydi Bengali Language baseline. IndicIRSuite is available at https://github.com/saifulhaq95/IndicIRSuite

dataset, indian language, indic-colbert, (13 more...)

arXiv.org Artificial Intelligence

2312.09508

Country:

North America > United States > Illinois (0.04)
Asia > India > Gujarat > Gandhinagar (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

An Anchor Learning Approach for Citation Field Learning

Yuan, Zilin, Chen, Borun, Dai, Yimeng, Li, Yinghui, Zheng, Hai-Tao, Zhang, Rui

arXiv.org Artificial IntelligenceDec-14-2023

Citation field learning is to segment a citation string into fields of interest such as author, title, and venue. Extracting such fields from citations is crucial for citation indexing, researcher profile analysis, etc. User-generated resources like academic homepages and Curriculum Vitae, provide rich citation field information. However, extracting fields from these resources is challenging due to inconsistent citation styles, incomplete sentence syntax, and insufficient training data. To address these challenges, we propose a novel algorithm, CIFAL (citation field learning by anchor learning), to boost the citation field learning performance. CIFAL leverages the anchor learning, which is model-agnostic for any Pre-trained Language Model, to help capture citation patterns from the data of different citation styles. The experiments demonstrate that CIFAL outperforms state-of-the-art methods in citation field learning, achieving a 2.68% improvement in field-level F1-scores. Extensive analysis of the results further confirms the effectiveness of CIFAL quantitatively and qualitatively.

anchor, citation field, venue field, (13 more...)

arXiv.org Artificial Intelligence

2309.03559

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Asia > Japan (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.36)

Add feedback

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

Yu, Shangdi, Engels, Joshua, Huang, Yihao, Shun, Julian

arXiv.org Artificial IntelligenceDec-13-2023

This paper studies density-based clustering of point sets. These methods use dense regions of points to detect clusters of arbitrary shapes. In particular, we study variants of density peaks clustering, a popular type of algorithm that has been shown to work well in practice. Our goal is to cluster large high-dimensional datasets, which are prevalent in practice. Prior solutions are either sequential, and cannot scale to large data, or are specialized for low-dimensional data. This paper unifies the different variants of density peaks clustering into a single framework, PECANN, by abstracting out several key steps common to this class of algorithms. One such key step is to find nearest neighbors that satisfy a predicate function, and one of the main contributions of this paper is an efficient way to do this predicate search using graph-based approximate nearest neighbor search (ANNS). To provide ample parallelism, we propose a doubling search technique that enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds. Our technique can be applied to many existing graph-based ANNS algorithms, which can all be plugged into PECANN. We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. Compared to the state-of-the-art FASTDP algorithm for high-dimensional density peaks clustering, which is sequential, our best algorithm is 45x-734x faster while achieving competitive ARI scores. Compared to the state-of-the-art parallel DPC-based algorithm, which is optimized for low dimensions, we show that PECANN is two orders of magnitude faster. As far as we know, our work is the first to evaluate DPC variants on large high-dimensional real-world image and text embedding datasets.

algorithm, dataset, neighbor, (13 more...)

arXiv.org Artificial Intelligence

2312.0394

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Finland > North Karelia > Joensuu (0.04)

Genre:

Overview (0.88)
Research Report (0.84)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Harnessing Retrieval-Augmented Generation (RAG) for Uncovering Knowledge Gaps

Hurtado, Joan Figuerola

arXiv.org Artificial IntelligenceDec-12-2023

The paper presents a methodology for uncovering knowledge gaps on the internet using the Retrieval Augmented Generation (RAG) model. By simulating user search behaviour, the RAG system identifies and addresses gaps in information retrieval systems. The study demonstrates the effectiveness of the RAG system in generating relevant suggestions with a consistent accuracy of 93%. The methodology can be applied in various fields such as scientific discovery, educational enhancement, research development, market analysis, search engine optimisation, and content development. The results highlight the value of identifying and understanding knowledge gaps to guide future endeavours.

knowledge gap, query, retrieval-augmented generation, (12 more...)

arXiv.org Artificial Intelligence

2312.07796

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Unsupervised Temporal Action Localization via Self-paced Incremental Learning

Tang, Haoyu, Jiang, Han, Xu, Mingzhu, Hu, Yupeng, Zhu, Jihua, Nie, Liqiang

arXiv.org Artificial IntelligenceDec-12-2023

Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.

iteration, localization, video, (15 more...)

arXiv.org Artificial Intelligence

2312.07384

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

Add feedback

Dense X Retrieval: What Retrieval Granularity Should We Use?

Chen, Tong, Wang, Hongwei, Chen, Sihao, Yu, Wenhao, Ma, Kaixin, Zhao, Xinran, Zhang, Hongming, Yu, Dong

arXiv.org Artificial IntelligenceDec-11-2023

Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval. Moreover, retrieval by proposition also enhances the performance of downstream QA tasks, since the retrieved texts are more condensed with question-relevant information, reducing the need for lengthy input tokens and minimizing the inclusion of extraneous, irrelevant information.

computational linguistic, proposition, retrieval, (14 more...)

arXiv.org Artificial Intelligence

2312.06648

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(12 more...)

Genre: Research Report > New Finding (0.88)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

FOSS: A Self-Learned Doctor for Query Optimizer

Zhong, Kai, Sun, Luming, Ji, Tao, Li, Cuiping, Chen, Hong

arXiv.org Artificial IntelligenceDec-11-2023

Various works have utilized deep reinforcement learning (DRL) to address the query optimization problem in database system. They either learn to construct plans from scratch in a bottom-up manner or guide the plan generation behavior of traditional optimizer using hints. While these methods have achieved some success, they face challenges in either low training efficiency or limited plan search space. To address these challenges, we introduce FOSS, a novel DRL-based framework for query optimization. FOSS initiates optimization from the original plan generated by a traditional optimizer and incrementally refines suboptimal nodes of the plan through a sequence of actions. Additionally, we devise an asymmetric advantage model to evaluate the advantage between two plans. We integrate it with a traditional optimizer to form a simulated environment. Leveraging this simulated environment, FOSS can bootstrap itself to rapidly generate a large amount of high-quality simulated experiences. FOSS then learns and improves its optimization capability from these simulated experiences. We evaluate the performance of FOSS on Join Order Benchmark, TPC-DS, and Stack Overflow. The experimental results demonstrate that FOSS outperforms the state-of-the-art methods in terms of latency performance and optimization time. Compared to PostgreSQL, FOSS achieves savings ranging from 15% to 83% in total latency across different benchmarks.

foss, optimizer, query optimizer, (16 more...)

arXiv.org Artificial Intelligence

2312.06357

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre:

Workflow (1.00)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.86)

Add feedback

Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering

McDonald, Tavish, Tsan, Brian, Saini, Amar, Ordonez, Juanita, Gutierrez, Luis, Nguyen, Phan, Mason, Blake, Ng, Brenda

arXiv.org Artificial IntelligenceDec-11-2023

Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.

extraction, pdfminer, qasper, (15 more...)

arXiv.org Artificial Intelligence

2210.01959

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Merced County > Merced (0.04)
North America > Dominican Republic (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)

Add feedback

Enhancing Medical Specialty Assignment to Patients using NLP Techniques

Solomou, Chris

arXiv.org Artificial IntelligenceDec-9-2023

The introduction of Large Language Models (LLMs), and the vast volume of publicly available medical data, amplified the application of NLP to the medical domain. However, LLMs are pretrained on data that are not explicitly relevant to the domain that are applied to and are often biased towards the original data they were pretrained upon. Even when pretrained on domainspecific data, these models typically require time-consuming fine-tuning to achieve good performance for a specific task. To address these limitations, we propose an alternative approach that achieves superior performance while being computationally efficient. Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text. Our proposal does not require pretraining nor fine-tuning and can be applied directly to a specific setting for performing multi-label classification. Our objective is to automatically assign a new patient to the specialty of the medical professional they require, using a dataset that contains medical transcriptions and relevant keywords. To this end, we fine-tune the PubMedBERT model on this dataset, which serves as the baseline for our experiments. We then twice train/fine-tune a DNN and the RoBERTa language model, using both the keywords and the full transcriptions as input. We compare the performance of these approaches using relevant metrics. Our results demonstrate that utilizing keywords for text classification significantly improves classification performance, for both a basic DL architecture and a large language model. Our approach represents a promising and efficient alternative to traditional methods for finetuning language models on domain-specific data and has potential applications in various medical domains

dataset, keyword, language model, (15 more...)

arXiv.org Artificial Intelligence

2312.05585

Country: Europe > United Kingdom > England > North Yorkshire > York (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Relevance Feedback with Brain Signals

Ye, Ziyi, Xie, Xiaohui, Ai, Qingyao, Liu, Yiqun, Wang, Zhihong, Su, Weihang, Zhang, Min

arXiv.org Artificial IntelligenceDec-9-2023

The Relevance Feedback (RF) process relies on accurate and real-time relevance estimation of feedback documents to improve retrieval performance. Since collecting explicit relevance annotations imposes an extra burden on the user, extensive studies have explored using pseudo-relevance signals and implicit feedback signals as substitutes. However, such signals are indirect indicators of relevance and suffer from complex search scenarios where user interactions are absent or biased. Recently, the advances in portable and high-precision brain-computer interface (BCI) devices have shown the possibility to monitor user's brain activities during search process. Brain signals can directly reflect user's psychological responses to search results and thus it can act as additional and unbiased RF signals. To explore the effectiveness of brain signals in the context of RF, we propose a novel RF framework that combines BCI-based relevance feedback with pseudo-relevance signals and implicit signals to improve the performance of document re-ranking. The experimental results on the user study dataset show that incorporating brain signals leads to significant performance improvement in our RF framework. Besides, we observe that brain signals perform particularly well in several hard search scenarios, especially when implicit signals as feedback are missing or noisy. This reveals when and how to exploit brain signals in the context of RF.

brain signal, participant, relevance score, (14 more...)

arXiv.org Artificial Intelligence

2312.05669

Country:

Asia > China > Beijing > Beijing (0.04)
Oceania > Australia (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
(5 more...)

Add feedback