AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Jeronymo, Vitor, Nascimento, Mauricio, Lotufo, Roberto, Nogueira, Rodrigo

arXiv.org Artificial IntelligenceSep-27-2022

Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset. In this paper, we present mRobust04, a multilingual version of Robust04 that was translated to 8 languages using Google Translate. We also provide results of three different multilingual retrievers on this dataset.

artificial intelligence, information retrieval, natural language, (17 more...)

arXiv.org Artificial Intelligence

2209.13738

Country:

South America > Brazil > São Paulo (0.05)
North America > United States > Maryland > Montgomery County > Gaithersburg (0.05)

Genre: Research Report (0.41)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.54)

Add feedback

What is a search engine result page?

#artificialintelligenceSep-26-2022, 08:32:41 GMT

With the development of the internet, the world is rapidly going towards digitalization. Search engine optimization (SEO) has become an important digital skill in the present day. Our website Digital Skills PK is continuing to share articles about basic knowledge of digital skills. Beginners can benefit from our articles. In this article, we will talk about the Google search engine results page(SERPs).

artificial intelligence, information retrieval, natural language, (18 more...)

#artificialintelligence

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

Shetty, Pranav, Rajan, Arunkumar Chitteth, Kuenneth, Christopher, Gupta, Sonkakshi, Panchumarti, Lakshmi Prerana, Holm, Lauren, Zhang, Chao, Ramprasad, Rampi

arXiv.org Artificial IntelligenceSep-26-2022

The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1038/s41524-023-01003-w

2209.13136

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry:

Energy > Renewable (0.71)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > Polymers & Plastics (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Access to care: analysis of the geographical distribution of healthcare using Linked Open Data

Santamaria, Selene Baez, Manousogiannis, Emmanouil, Boomgaard, Guusje, Tran, Linh P., Szlavik, Zoltan, Sips, Robert-Jan

arXiv.org Artificial IntelligenceSep-26-2022

Background: Access to medical care is strongly dependent on resource allocation, such as the geographical distribution of medical facilities. Nevertheless, this data is usually restricted to country official documentation, not available to the public. While some medical facilities' data is accessible as semantic resources on the Web, it is not consistent in its modeling and has yet to be integrated into a complete, open, and specialized repository. This work focuses on generating a comprehensive semantic dataset of medical facilities worldwide containing extensive information about such facilities' geo-location. Results: For this purpose, we collect, align, and link various open-source databases where medical facilities' information may be present. This work allows us to evaluate each data source along various dimensions, such as completeness, correctness, and interlinking with other sources, all critical aspects of current knowledge representation technologies. Conclusions: Our contributions directly benefit stakeholders in the biomedical and health domain (patients, healthcare professionals, companies, regulatory authorities, and researchers), who will now have a better overview of the access to and distribution of medical facilities.

artificial intelligence, information retrieval, natural language, (19 more...)

arXiv.org Artificial Intelligence

2204.05206

Country:

Europe > Netherlands > North Holland > Amsterdam (0.05)
North America > United States > Michigan (0.04)
Europe > United Kingdom (0.04)
(7 more...)

Genre: Research Report > Experimental Study (0.52)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Promptagator: Few-shot Dense Retrieval From 8 Examples

Dai, Zhuyun, Zhao, Vincent Y., Ma, Ji, Luan, Yi, Ni, Jianmo, Lu, Jing, Bakalov, Anton, Guu, Kelvin, Hall, Keith B., Chang, Ming-Wei

arXiv.org Artificial IntelligenceSep-23-2022

Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval tasks, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. Surprisingly, LLM prompting with no more than 8 examples allows dual encoders to outperform heavily engineered models trained on MS MARCO like ColBERT v2 (Santhanam et al., 2022) by more than 1.2 nDCG on average on 11 retrieval sets. Further training standard-size re-rankers using the same generated data yields another 5.0 point nDCG improvement. Our studies determine that query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given. Recently, major progress has been made on neural retrieval models such as dual encoders, which can retrieve knowledge from a large collection of documents containing millions to billions of passages (Yih et al., 2011; Lee et al., 2019; Karpukhin et al., 2020). However, Thakur et al. (2021) recently proposed the BEIR heterogeneous retrieval benchmark, and showed that it is still difficult for neural retrievers to perform well on a wide variety of retrieval tasks that lack dedicated training data. Thus, previous approaches focus on transferring knowledge from question answering (QA) datasets such as MS MARCO (Nguyen et al., 2016). To best transfer from QA datasets, expressive retrievers are developed that allow fine-grained token-level interaction such as ColBERT (Khattab & Zaharia, 2020; Santhanam et al., 2022) and SPLADE (Formal et al., 2021) but with higher inference cost.

large language model, machine learning, question answering, (19 more...)

arXiv.org Artificial Intelligence

2209.11755

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > Dominican Republic (0.04)
(9 more...)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.66)

Add feedback

AIR-JPMC@SMM4H'22: Classifying Self-Reported Intimate Partner Violence in Tweets with Multiple BERT-based Models

Candidato, Alec, Gupta, Akshat, Liu, Xiaomo, Shah, Sameena

arXiv.org Artificial IntelligenceSep-21-2022

This paper presents our submission for the SMM4H 2022-Shared Task on the classification of self-reported intimate partner violence on Twitter (in English). The goal of this task was to accurately determine if the contents of a given tweet demonstrated someone reporting their own experience with intimate partner violence. The submitted system is an ensemble of five RoBERTa models each weighted by their respective F1-scores on the validation data-set. This system performed 13% better than the baseline and was the best performing system overall for this shared task.

information retrieval, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2209.10763

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.15)
North America > Mexico > Mexico City > Mexico City (0.05)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.30)

Add feedback

Semantic Structure based Query Graph Prediction for Question Answering over Knowledge Graph

Li, Mingchen, Ji, Shihao

arXiv.org Artificial IntelligenceSep-21-2022

Building query graphs from natural language questions is an important step in complex question answering over knowledge graph (Complex KGQA). In general, a question can be correctly answered if its query graph is built correctly and the right answer is then retrieved by issuing the query graph against the KG. Therefore, this paper focuses on query graph generation from natural language questions. Existing approaches for query graph generation ignore the semantic structure of a question, resulting in a large number of noisy query graph candidates that undermine prediction accuracies. In this paper, we define six semantic structures from common questions in KGQA and develop a novel Structure-BERT to predict the semantic structure of a question. By doing so, we can first filter out noisy candidate query graphs, and then rank the remaining candidates with a BERT-based ranking model. Extensive experiments on two popular benchmarks MetaQA and WebQuestionsSP (WSP) demonstrate the effectiveness of our method as compared to state-of-the-arts.

artificial intelligence, natural language, query graph, (17 more...)

arXiv.org Artificial Intelligence

2204.10194

Country:

Europe > Portugal > Lisbon > Lisbon (0.04)
North America > United States > Colorado > Weld County > Greeley (0.04)

Genre: Research Report (0.50)

Industry:

Media > Film (0.69)
Leisure & Entertainment (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)

Add feedback

Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases

Aydin, Gizem, Tabatabaei, Seyed Amin, Tsatsaronis, Giorgios, Hasibi, Faegheh

arXiv.org Artificial IntelligenceSep-20-2022

Automatic extraction of funding information from academic articles adds significant value to industry and research communities, such as tracking research outcomes by funding organizations, profiling researchers and universities based on the received funding, and supporting open access policies. Two major challenges of identifying and linking funding entities are: (i) sparse graph structure of the Knowledge Base (KB), which makes the commonly used graph-based entity linking approaches suboptimal for the funding domain, (ii) missing entities in KB, which (unlike recent zero-shot approaches) requires marking entity mentions without KB entries as NIL. We propose an entity linking model that can perform NIL prediction and overcome data scarcity issues in a time and data-efficient manner. Our model builds on a transformer-based mention detection and bi-encoder model to perform entity linking. We show that our model outperforms strong existing baselines.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2209.00351

Country:

North America > United States (0.14)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.62)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.48)
(2 more...)

Add feedback

Debiasing Gender Bias in Information Retrieval Models

Sundararaman, Dhanasekar, Subramanian, Vivek

arXiv.org Artificial IntelligenceSep-20-2022

Biases in culture, gender, ethnicity, etc. have existed for decades and have affected many areas of human social interaction. These biases have been shown to impact machine learning (ML) models, and for natural language processing (NLP), this can have severe consequences for downstream tasks. Mitigating gender bias in information retrieval (IR) is important to avoid propagating stereotypes. In this work, we employ a dataset consisting of two components: (1) relevance of a document to a query and (2) "gender" of a document, in which pronouns are replaced by male, female, and neutral conjugations. We definitively show that pre-trained models for IR do not perform well in zero-shot retrieval tasks when full fine-tuning of a large pre-trained BERT encoder is performed and that lightweight fine-tuning performed with adapter networks improves zero-shot retrieval performance almost by 20% over baseline. We also illustrate that pre-trained models have gender biases that result in retrieved articles tending to be more often male than female. We overcome this by introducing a debiasing technique that penalizes the model when it prefers males over females, resulting in an effective model that retrieves articles in a balanced fashion across genders.

artificial intelligence, information retrieval, natural language, (16 more...)

arXiv.org Artificial Intelligence

2208.01755

Country:

North America > United States > North Carolina > Durham County > Durham (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot Tagging

Rosenbaum, Andy, Soltan, Saleh, Hamza, Wael, Versley, Yannick, Boese, Markus

arXiv.org Artificial IntelligenceSep-20-2022

We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a wide margin, showing absolute improvement for the target intents of +1.9 points on IC Recall and +2.5 points on ST F1 Score. In the zero-shot cross-lingual setting of the mATIS++ dataset, LINGUIST out-performs a strong baseline of Machine Translation with Slot Alignment by +4.14 points absolute on ST F1 Score across 6 languages, while matching performance on IC. Finally, we verify our results on an internal large-scale multilingual dataset for conversational agent IC+ST and show significant improvements over a baseline which uses Back-Translation, Paraphrasing and Slot Catalog Resampling. To our knowledge, we are the first to demonstrate instruction fine-tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation.

information retrieval, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2209.099

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(15 more...)

Genre: Research Report > Promising Solution (0.47)

Industry:

Leisure & Entertainment > Sports > Hockey (1.00)
Leisure & Entertainment > Sports > Baseball (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
(2 more...)

Add feedback