Goto

Collaborating Authors

 Information Retrieval


mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

arXiv.org Artificial Intelligence

Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset. In this paper, we present mRobust04, a multilingual version of Robust04 that was translated to 8 languages using Google Translate. We also provide results of three different multilingual retrievers on this dataset.


What is a search engine result page?

#artificialintelligence

With the development of the internet, the world is rapidly going towards digitalization. Search engine optimization (SEO) has become an important digital skill in the present day. Our website Digital Skills PK is continuing to share articles about basic knowledge of digital skills. Beginners can benefit from our articles. In this article, we will talk about the Google search engine results page(SERPs).


A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

arXiv.org Artificial Intelligence

The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.


Access to care: analysis of the geographical distribution of healthcare using Linked Open Data

arXiv.org Artificial Intelligence

Background: Access to medical care is strongly dependent on resource allocation, such as the geographical distribution of medical facilities. Nevertheless, this data is usually restricted to country official documentation, not available to the public. While some medical facilities' data is accessible as semantic resources on the Web, it is not consistent in its modeling and has yet to be integrated into a complete, open, and specialized repository. This work focuses on generating a comprehensive semantic dataset of medical facilities worldwide containing extensive information about such facilities' geo-location. Results: For this purpose, we collect, align, and link various open-source databases where medical facilities' information may be present. This work allows us to evaluate each data source along various dimensions, such as completeness, correctness, and interlinking with other sources, all critical aspects of current knowledge representation technologies. Conclusions: Our contributions directly benefit stakeholders in the biomedical and health domain (patients, healthcare professionals, companies, regulatory authorities, and researchers), who will now have a better overview of the access to and distribution of medical facilities.


Promptagator: Few-shot Dense Retrieval From 8 Examples

arXiv.org Artificial Intelligence

Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval tasks, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. Surprisingly, LLM prompting with no more than 8 examples allows dual encoders to outperform heavily engineered models trained on MS MARCO like ColBERT v2 (Santhanam et al., 2022) by more than 1.2 nDCG on average on 11 retrieval sets. Further training standard-size re-rankers using the same generated data yields another 5.0 point nDCG improvement. Our studies determine that query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given. Recently, major progress has been made on neural retrieval models such as dual encoders, which can retrieve knowledge from a large collection of documents containing millions to billions of passages (Yih et al., 2011; Lee et al., 2019; Karpukhin et al., 2020). However, Thakur et al. (2021) recently proposed the BEIR heterogeneous retrieval benchmark, and showed that it is still difficult for neural retrievers to perform well on a wide variety of retrieval tasks that lack dedicated training data. Thus, previous approaches focus on transferring knowledge from question answering (QA) datasets such as MS MARCO (Nguyen et al., 2016). To best transfer from QA datasets, expressive retrievers are developed that allow fine-grained token-level interaction such as ColBERT (Khattab & Zaharia, 2020; Santhanam et al., 2022) and SPLADE (Formal et al., 2021) but with higher inference cost.


AIR-JPMC@SMM4H'22: Classifying Self-Reported Intimate Partner Violence in Tweets with Multiple BERT-based Models

arXiv.org Artificial Intelligence

This paper presents our submission for the SMM4H 2022-Shared Task on the classification of self-reported intimate partner violence on Twitter (in English). The goal of this task was to accurately determine if the contents of a given tweet demonstrated someone reporting their own experience with intimate partner violence. The submitted system is an ensemble of five RoBERTa models each weighted by their respective F1-scores on the validation data-set. This system performed 13% better than the baseline and was the best performing system overall for this shared task.


Semantic Structure based Query Graph Prediction for Question Answering over Knowledge Graph

arXiv.org Artificial Intelligence

Building query graphs from natural language questions is an important step in complex question answering over knowledge graph (Complex KGQA). In general, a question can be correctly answered if its query graph is built correctly and the right answer is then retrieved by issuing the query graph against the KG. Therefore, this paper focuses on query graph generation from natural language questions. Existing approaches for query graph generation ignore the semantic structure of a question, resulting in a large number of noisy query graph candidates that undermine prediction accuracies. In this paper, we define six semantic structures from common questions in KGQA and develop a novel Structure-BERT to predict the semantic structure of a question. By doing so, we can first filter out noisy candidate query graphs, and then rank the remaining candidates with a BERT-based ranking model. Extensive experiments on two popular benchmarks MetaQA and WebQuestionsSP (WSP) demonstrate the effectiveness of our method as compared to state-of-the-arts.


Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases

arXiv.org Artificial Intelligence

Automatic extraction of funding information from academic articles adds significant value to industry and research communities, such as tracking research outcomes by funding organizations, profiling researchers and universities based on the received funding, and supporting open access policies. Two major challenges of identifying and linking funding entities are: (i) sparse graph structure of the Knowledge Base (KB), which makes the commonly used graph-based entity linking approaches suboptimal for the funding domain, (ii) missing entities in KB, which (unlike recent zero-shot approaches) requires marking entity mentions without KB entries as NIL. We propose an entity linking model that can perform NIL prediction and overcome data scarcity issues in a time and data-efficient manner. Our model builds on a transformer-based mention detection and bi-encoder model to perform entity linking. We show that our model outperforms strong existing baselines.


Debiasing Gender Bias in Information Retrieval Models

arXiv.org Artificial Intelligence

Biases in culture, gender, ethnicity, etc. have existed for decades and have affected many areas of human social interaction. These biases have been shown to impact machine learning (ML) models, and for natural language processing (NLP), this can have severe consequences for downstream tasks. Mitigating gender bias in information retrieval (IR) is important to avoid propagating stereotypes. In this work, we employ a dataset consisting of two components: (1) relevance of a document to a query and (2) "gender" of a document, in which pronouns are replaced by male, female, and neutral conjugations. We definitively show that pre-trained models for IR do not perform well in zero-shot retrieval tasks when full fine-tuning of a large pre-trained BERT encoder is performed and that lightweight fine-tuning performed with adapter networks improves zero-shot retrieval performance almost by 20% over baseline. We also illustrate that pre-trained models have gender biases that result in retrieved articles tending to be more often male than female. We overcome this by introducing a debiasing technique that penalizes the model when it prefers males over females, resulting in an effective model that retrieves articles in a balanced fashion across genders.


LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot Tagging

arXiv.org Artificial Intelligence

We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a wide margin, showing absolute improvement for the target intents of +1.9 points on IC Recall and +2.5 points on ST F1 Score. In the zero-shot cross-lingual setting of the mATIS++ dataset, LINGUIST out-performs a strong baseline of Machine Translation with Slot Alignment by +4.14 points absolute on ST F1 Score across 6 languages, while matching performance on IC. Finally, we verify our results on an internal large-scale multilingual dataset for conversational agent IC+ST and show significant improvements over a baseline which uses Back-Translation, Paraphrasing and Slot Catalog Resampling. To our knowledge, we are the first to demonstrate instruction fine-tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation.