AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Adversarial Augmentation Policy Search for Domain and Cross-Lingual Generalization in Reading Comprehension

Maharana, Adyasha, Bansal, Mohit

arXiv.org Artificial IntelligenceMay-1-2020

Reading comprehension models often overfit to nuances of training datasets and fail at adversarial evaluation. Training with adversarially augmented dataset improves robustness against those adversarial attacks but hurts generalization of the models. In this work, we present several effective adversaries and automated data augmentation policy search methods with the goal of making reading comprehension models more robust to adversarial evaluation, but also improving generalization to the source domain as well as new domains and languages. We first propose three new methods for generating QA adversaries, that introduce multiple points of confusion within the context, show dependence on insertion location of the distractor, and reveal the compounding effect of mixing adversarial strategies with syntactic and semantic paraphrasing methods. Next, we find that augmenting the training datasets with uniformly sampled adversaries improves robustness to the adversarial attacks but leads to decline in performance on the original unaugmented dataset. We address this issue via RL and more efficient Bayesian policy search methods for automatically learning the best augmentation policy combinations of the transformation probability for each adversary in a large search space. Using these learned policies, we show that adversarial training can lead to significant improvements in in-domain, out-of-domain, and cross-lingual (German, Russian, Turkish) generalization without any use of training data from the target domain or language.

adversary, dataset, roberta base, (16 more...)

arXiv.org Artificial Intelligence

2004.06076

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(14 more...)

Genre: Research Report (1.00)

Industry:

Education > Assessment & Standards > Student Performance (0.82)
Information Technology > Security & Privacy (0.69)
Government > Military (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.89)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)

Add feedback

Rapidly Bootstrapping a Question Answering Dataset for COVID-19

Tang, Raphael, Nogueira, Rodrigo, Zhang, Edwin, Gupta, Nikhil, Cam, Phuong, Cho, Kyunghyun, Lin, Jimmy

arXiv.org Artificial IntelligenceApr-23-2020

We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge. To our knowledge, this is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available. While this dataset, comprising 124 question-article pairs as of the present version 0.1 release, does not have sufficient examples for supervised machine learning, we believe that it can be helpful for evaluating the zero-shot or transfer capabilities of existing models on topics specifically related to COVID-19. This paper describes our methodology for constructing the dataset and presents the effectiveness of a number of baselines, including term-based techniques and various transformer-based models. The dataset is available at http://covidqa.ai/

dataset, effectiveness, natural language question, (14 more...)

arXiv.org Artificial Intelligence

2004.11339

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
(8 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.54)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Deriu, Jan, Mlynchyk, Katsiaryna, Schläpfer, Philippe, Rodrigo, Alvaro, von Grünigen, Dirk, Kaiser, Nicolas, Stockinger, Kurt, Agirre, Eneko, Cieliebak, Mark

arXiv.org Artificial IntelligenceApr-16-2020

In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database called Operation Trees (OT). This representation allows us to invert the annotation process without losing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of query tokens to OT operations. In our method, we randomly generate OTs from a context-free grammar. Afterwards, annotators have to write the appropriate natural language question that is represented by the OT. Finally, the annotators assign the tokens to the OT operations. We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases. We compare OTTA to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

database, opération, query, (16 more...)

arXiv.org Artificial Intelligence

2004.07633

Country:

Europe > France (0.04)
South America > Argentina (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(10 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.49)

Add feedback

Complaint-driven Training Data Debugging for Query 2.0

Wu, Weiyuan, Flokas, Lampros, Wu, Eugene, Wang, Jiannan

arXiv.org Artificial IntelligenceApr-12-2020

As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query's intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps. We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets. Results show that Rain achieves the highest recall@k among all the baselines while still returns results interactively.

query, query 2, training record, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3318464.3389696

2004.05722

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Oregon > Multnomah County > Portland (0.05)
(22 more...)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.88)

Add feedback

The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews

Tutubalina, Elena, Alimova, Ilseyar, Miftahutdinov, Zulfat, Sakhovskiy, Andrey, Malykh, Valentin, Nikolenko, Sergey

arXiv.org Artificial IntelligenceApr-7-2020

The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labelled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labelled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications, and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multi-label sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data. We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available at https://github.com/cimm-kzn/RuDReC

annotator, corpus, russian drug reaction corpus, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1093/bioinformatics/btaa675

2004.03659

Country:

North America > United States (0.28)
Asia > Russia (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
(2 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.91)

Add feedback

Answering Complex Queries in Knowledge Graphs with Bidirectional Sequence Encoders

Kotnis, Bhushan, Lawrence, Carolin, Niepert, Mathias

arXiv.org Artificial IntelligenceApr-6-2020

Representation learning for knowledge graphs (KGs) has focused on the problem of answering simple link prediction queries. In this work we address the more ambitious challenge of predicting the answers of conjunctive queries with multiple missing entities. We propose Bi-Directional Query Embedding (\textsc{BiQE}), a method that embeds conjunctive queries with models based on bi-directional attention mechanisms. Contrary to prior work, bidirectional self-attention can capture interactions among all the elements of a query graph. We introduce a new dataset for predicting the answer of conjunctive query and conduct experiments that show \textsc{BiQE} significantly outperforming state of the art baselines.

dataset, graph, query, (15 more...)

arXiv.org Artificial Intelligence

2004.02596

Country: Europe (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.65)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.36)

Add feedback

Knowledge Fusion and Semantic Knowledge Ranking for Open Domain Question Answering

Banerjee, Pratyay, Baral, Chitta

arXiv.org Artificial IntelligenceApr-6-2020

Open Domain Question Answering requires systems to retrieve external knowledge and perform multi-hop reasoning by composing knowledge spread over multiple sentences. In the recently introduced open domain question answering challenge datasets, QASC and OpenBookQA, we need to perform retrieval of facts and compose facts to correctly answer questions. In our work, we learn a semantic knowledge ranking model to re-rank knowledge retrieved through Lucene based information retrieval systems. We further propose a ``knowledge fusion model'' which leverages knowledge in BERT-based language models with externally retrieved knowledge and improves the knowledge understanding of the BERT-based language models. On both OpenBookQA and QASC datasets, the knowledge fusion model with semantically re-ranked knowledge outperforms previous attempts.

dataset, knowledge, semantic knowledge ranking, (12 more...)

arXiv.org Artificial Intelligence

2004.03101

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Arizona (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Genre: Research Report (0.40)

Industry: Education (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.86)
(2 more...)

Add feedback

How to provide relevant Search Results - Paperless Lab Academy

#artificialintelligenceMar-31-2020, 15:56:20 GMT

The relevance of search results is essential for finding information. Indeed, a user will almost never look further than the first few results of a search engine. It is therefore necessary that the relevant information is ranked as high as possible so that the information sought by the user is found in the first results. The order, or "ranking" of search results is essential for search engines, which will therefore use more or less complex algorithms to display the results that users will find most relevant first. It is usually not possible to find the algorithms used by popular search engines.

algorithm, search engine, search result, (12 more...)

#artificialintelligence

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.30)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.87)

Add feedback

ExpertFile COVID-19 Search Engine Connects Journalists, Experts

#artificialintelligenceMar-31-2020, 15:46:55 GMT

Curated Online Resource Puts Journalists a Click Away From Hundreds of Healthcare, Economic, Industry and Social Science Experts for Quick and Reliable Sources on the Current Coronavirus Pandemic. In response to unprecedented demand for expert sources and fact-based insights during the COVID-19 pandemic, ExpertFile has launched the COVID-19 Experts Search Engine, a specialized online resource designed to help newsrooms around the world;access reliable experts to speak on a variety of topics related to the coronavirus. With millions affected worldwide by the COVID-19 pandemic, the dangers of misinformation and factual inaccuracy pose a potentially devastating impact on society. As the largest curated, open-access search engine of international expert sources, ExpertFile worked quickly and in close consultation with its members -- including healthcare professionals, university academics, NGO's, corporations, industry associations and journalists -- to build the COVID-19 Experts Search Engine. "Facts matter more than opinions when real lives are at stake. We understand that journalists need evidence-based information, and they need it quickly," said Peter Evans, Co-Founder & CEO of ExpertFile.

covid-19 expert search engine, expertfile, journalist, (7 more...)

#artificialintelligence

Country:

North America > United States > Colorado > Weld County > Evans (0.26)
North America > United States > New York (0.06)
North America > United States > California > Los Angeles County > Los Angeles (0.06)
(2 more...)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

COVID-Consumers: Pessimistic, but spending more online - Search Engine Land

#artificialintelligenceMar-26-2020, 02:37:50 GMT

Consumer sentiment has turned sharply negative as the virus has disrupted every aspect of daily American life. According to a consumer survey from Engine, 88% of consumers in the U.S. are now concerned about the pandemic. And according to another survey of roughly 2,600 U.S. adults from L.E.K. Consulting and Civis (.pdf), between 80% and 90% of adults expect a recession next year. In addition to measuring consumer sentiment, the survey explored how the coronavirus has shifted buying patterns across industries. Generally, the survey finds "significant increases in at-home activities, particularly cooking at home, watching television, browsing social media and exercising at home."

consumer, online, search engine land, (11 more...)

#artificialintelligence

Country: North America > United States (0.26)

Genre: Questionnaire & Opinion Survey (0.52)

Industry:

Health & Medicine (0.85)
Retail (0.53)

Technology:

Information Technology > Information Management > Search (0.85)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Add feedback