AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

N-Best Hypotheses Reranking for Text-To-SQL Systems

Zeng, Lu, Parthasarathi, Sree Hari Krishnan, Hakkani-Tur, Dilek

arXiv.org Artificial IntelligenceOct-19-2022

Text-to-SQL task maps natural language utterances to structured queries that can be issued to a database. State-of-the-art (SOTA) systems rely on finetuning large, pre-trained language models in conjunction with constrained decoding applying a SQL parser. On the well established Spider dataset, we begin with Oracle studies: specifically, choosing an Oracle hypothesis from a SOTA model's 10-best list, yields a $7.7\%$ absolute improvement in both exact match (EM) and execution (EX) accuracy, showing significant potential improvements with reranking. Identifying coherence and correctness as reranking approaches, we design a model generating a query plan and propose a heuristic schema linking algorithm. Combining both approaches, with T5-Large, we obtain a consistent $1\% $ improvement in EM accuracy, and a $~2.5\%$ improvement in EX, establishing a new SOTA for this task. Our comprehensive error studies on DEV data show the underlying difficulty in making progress on this task.

information retrieval, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2210.10668

Country:

North America > United States (0.14)
Europe > Netherlands > Gelderland (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.32)

Add feedback

Whole Page Unbiased Learning to Rank

Mao, Haitao, Zou, Lixin, Zheng, Yujia, Tang, Jiliang, Chu, Xiaokai, Zhao, Jiashu, Yin, Dawei

arXiv.org Artificial IntelligenceOct-19-2022

The page presentation biases in the information retrieval system, especially on the click behavior, is a well-known challenge that hinders improving ranking models' performance with implicit user feedback. Unbiased Learning to Rank~(ULTR) algorithms are then proposed to learn an unbiased ranking model with biased click data. However, most existing algorithms are specifically designed to mitigate position-related bias, e.g., trust bias, without considering biases induced by other features in search result page presentation(SERP). For example, the multimedia type may generate attractive bias. Unfortunately, those biases widely exist in industrial systems and may lead to an unsatisfactory search experience. Therefore, we introduce a new problem, i.e., whole-page Unbiased Learning to Rank(WP-ULTR), aiming to handle biases induced by whole-page SERP features simultaneously. It presents tremendous challenges. For example, a suitable user behavior model (user behavior hypothesis) can be hard to find; and complex biases cannot be handled by existing algorithms. To address the above challenges, we propose a Bias Agnostic whole-page unbiased Learning to rank algorithm, BAL, to automatically discover and mitigate the biases from multiple SERP features with no specific design. Experimental results on a real-world dataset verify the effectiveness of the BAL.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2210.10718

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Myanmar > Tanintharyi Region > Dawei (0.05)
Asia > China (0.04)
(6 more...)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback

Towards Proactive Information Retrieval in Noisy Text with Wikipedia Concepts

Ahmed, Tabish, Bulathwela, Sahan

arXiv.org Artificial IntelligenceOct-18-2022

Extracting useful information from the user history to clearly understand informational needs is a crucial feature of a proactive information retrieval system. Regarding understanding information and relevance, Wikipedia can provide the background knowledge that an intelligent system needs. This work explores how exploiting the context of a query using Wikipedia concepts can improve proactive information retrieval on noisy text. We formulate two models that use entity linking to associate Wikipedia topics with the relevance model. Our experiments around a podcast segment retrieval task demonstrate that there is a clear signal of relevance in Wikipedia concepts while a ranking model can improve precision by incorporating them. We also find Wikifying the background context of a query can help disambiguate the meaning of the query, further helping proactive information retrieval.

artificial intelligence, information retrieval, natural language, (14 more...)

arXiv.org Artificial Intelligence

2210.09877

Country:

North America > United States > Maryland (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > Dominican Republic (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.94)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

Zhang, Xinyu, Thakur, Nandan, Ogundepo, Odunayo, Kamalloo, Ehsan, Alfonso-Hermelo, David, Li, Xiaoguang, Liu, Qun, Rezagholizadeh, Mehdi, Lin, Jimmy

arXiv.org Artificial IntelligenceOct-18-2022

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources -- including what researchers typically characterize as high-resource as well as low-resource languages. Our dataset is designed to support the creation and evaluation of models for monolingual retrieval, where the queries and the corpora are in the same language. In total, we have gathered over 700k high-quality relevance judgments for around 77k queries over Wikipedia in these 18 languages, where all assessments have been performed by native speakers hired by our team. Our goal is to spur research that will improve retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved. This overview paper describes the dataset and baselines that we share with the community. The MIRACL website is live at http://miracl.ai/.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.09984

Country:

North America > Canada > Alberta (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
North America > Dominican Republic (0.04)
(7 more...)

Genre:

Research Report (0.40)
Overview (0.34)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.86)

Add feedback

RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

Xiao, Shitao, Liu, Zheng, Shao, Yingxia, Cao, Zhao

arXiv.org Artificial IntelligenceOct-17-2022

Despite pre-training's progress in many important NLP tasks, it remains to explore effective pre-training strategies for dense retrieval. In this paper, we propose RetroMAE, a new retrieval oriented pre-training paradigm based on Masked Auto-Encoder (MAE). RetroMAE is highlighted by three critical designs. 1) A novel MAE workflow, where the input sentence is polluted for encoder and decoder with different masks. The sentence embedding is generated from the encoder's masked input; then, the original sentence is recovered based on the sentence embedding and the decoder's masked input via masked language modeling. 2) Asymmetric model structure, with a full-scale BERT like transformer as encoder, and a one-layer transformer as decoder. 3) Asymmetric masking ratios, with a moderate ratio for encoder: 15~30%, and an aggressive ratio for decoder: 50~70%. Our framework is simple to realize and empirically competitive: the pre-trained models dramatically improve the SOTA performances on a wide range of dense retrieval benchmarks, like BEIR and MS MARCO. The source code and pre-trained models are made publicly available at https://github.com/staoxiao/RetroMAE so as to inspire more interesting research.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2205.12035

Country:

Europe > Northern Europe (0.04)
Asia > China > Beijing > Beijing (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.84)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)

Add feedback

Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer

Pęzik, Piotr, Mikołajczyk-Bareła, Agnieszka, Wawrzyński, Adam, Nitoń, Bartłomiej, Ogrodniczuk, Maciej

arXiv.org Artificial IntelligenceOct-17-2022

The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.

artificial intelligence, information retrieval, natural language, (13 more...)

arXiv.org Artificial Intelligence

2209.14008

Country:

Europe > Germany (0.14)
Europe > Poland > Świętokrzyskie Province > Kielce (0.04)
Europe > Italy > Sicily (0.04)
(10 more...)

Genre: Research Report (1.00)

Industry: Media > News (0.35)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)

Add feedback

Overview of BioASQ 2022: The tenth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

Nentidis, Anastasios, Katsimpras, Georgios, Vandorou, Eirini, Krithara, Anastasia, Miranda-Escalada, Antonio, Gasco, Luis, Krallinger, Martin, Paliouras, Georgios

arXiv.org Artificial IntelligenceOct-13-2022

This paper presents an overview of the tenth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2022. BioASQ is an ongoing series of challenges that promotes advances in the domain of large-scale biomedical semantic indexing and question answering. In this edition, the challenge was composed of the three established tasks a, b, and Synergy, and a new task named DisTEMIST for automatic semantic annotation and grounding of diseases from clinical content in Spanish, a key concept for semantic indexing and search engines of literature and clinical records. This year, BioASQ received more than 170 distinct systems from 38 teams in total for the four different tasks of the challenge. As in previous years, the majority of the competing systems outperformed the strong baselines, indicating the continuous advancement of the state-of-the-art in this domain.

information retrieval, machine learning, question answering, (23 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-13643-6_22

2210.06852

Country:

Europe > Portugal > Aveiro > Aveiro (0.04)
Europe > Greece > Central Macedonia > Thessaloniki (0.04)
South America > Argentina (0.04)
(7 more...)

Genre:

Research Report (1.00)
Overview (0.86)

Industry: Health & Medicine > Health Care Technology > Medical Record (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Query Expansion Using Contextual Clue Sampling with Language Models

Liu, Linqing, Li, Minghan, Lin, Jimmy, Riedel, Sebastian, Stenetorp, Pontus

arXiv.org Artificial IntelligenceOct-13-2022

Query expansion is an effective approach for mitigating vocabulary mismatch between queries and documents in information retrieval. One recent line of research uses language models to generate query-related contexts for expansion. Along this line, we argue that expansion terms from these contexts should balance two key aspects: diversity and relevance. The obvious way to increase diversity is to sample multiple contexts from the language model. However, this comes at the cost of relevance, because there is a well-known tendency of models to hallucinate incorrect or irrelevant contexts. To balance these two considerations, we propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context. Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR, while reducing the index size by more than 96%. For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.

artificial intelligence, contextual clue, natural language, (14 more...)

arXiv.org Artificial Intelligence

2210.07093

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Dominican Republic (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.61)

Add feedback

PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction

Schopf, Tim, Klimek, Simon, Matthes, Florian

arXiv.org Artificial IntelligenceOct-12-2022

Keyphrase extraction is the process of automatically selecting a small set of most relevant phrases from a given text. Supervised keyphrase extraction approaches need large amounts of labeled training data and perform poorly outside the domain of the training data. In this paper, we present PatternRank, which leverages pretrained language models and part-of-speech for unsupervised keyphrase extraction from single documents. Our experiments show PatternRank achieves higher precision, recall and F1-scores than previous state-of-the-art approaches. In addition, we present the KeyphraseVectorizers package, which allows easy modification of part-of-speech patterns for candidate keyphrase selection, and hence adaptation of our approach to any domain.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.5220/0011546600003335

2210.05245

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
Oceania > Australia > New South Wales > Sydney (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Step out of KG: Knowledge Graph Completion via Knowledgeable Retrieval and Reading Comprehension

Lv, Xin, Lin, Yankai, Yao, Zijun, Zeng, Kaisheng, Zhang, Jiajie, Hou, Lei, Li, Juanzi

arXiv.org Artificial IntelligenceOct-12-2022

Knowledge graphs, as the cornerstone of many AI applications, usually face serious incompleteness problems. In recent years, there have been many efforts to study automatic knowledge graph completion (KGC), most of which use existing knowledge to infer new knowledge. However, in our experiments, we find that not all relations can be obtained by inference, which constrains the performance of existing models. To alleviate this problem, we propose a new model based on information retrieval and reading comprehension, namely IR4KGC. Specifically, we pre-train a knowledge-based information retrieval module that can retrieve documents related to the triples to be completed. Then, the retrieved documents are handed over to the reading comprehension module to generate the predicted answers. In experiments, we find that our model can well solve relations that cannot be inferred from existing knowledge, and achieve good results on KGC datasets.

information retrieval, natural language, relation, (15 more...)

arXiv.org Artificial Intelligence

2210.05921

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > China > Beijing > Beijing (0.04)
(5 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Leisure & Entertainment > Sports (1.00)
Education > Assessment & Standards > Student Performance (0.81)
Media > Film (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.83)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.55)

Add feedback