AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Ambiverse - an amazing open-source suite for natural language understanding

#artificialintelligenceJun-5-2019, 10:57:57 GMT

While doing performance benchmarks for Named Entity Linking solutions for our AI/FinTech start-up Risklio, I stumbled upon a very powerful, only just open-sourced framework called AmbiverseNLU. It was developed by Ambiverse and is based on work previously done at the Max Planck Institute¹. The components it uses are more well-known: entity recognition from KnowNER², open information extraction using ClausIE³ and AIDA, an entity detection and disambiguation tool⁴. You can have a look at the demo here. For the former one you can choose whether to use Apache Cassandra or PostgreSQL as a backend, while the last one uses Neo4j.

artificial intelligence, information retrieval, natural language, (17 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.42)
Information Technology > Artificial Intelligence > Natural Language > Understanding (0.40)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Add feedback

Terminology-based Text Embedding for Computing Document Similarities on Technical Content

Mirisaee, Hamid, Gaussier, Eric, Lagnier, Cedric, Guerraz, Agnes

arXiv.org Machine LearningJun-5-2019

We propose in this paper a new, hybrid document embedding approach in order to address the problem of document similarities with respect to the technical content. To do so, we employ a state-of-the-art graph techniques to first extract the keyphrases (composite keywords) of documents and, then, use them to score the sentences. Using the ranked sentences, we propose two approaches to embed documents and show their performances with respect to two baselines. With domain expert annotations, we illustrate that the proposed methods can find more relevant documents and outperform the baselines up to 27% in terms of NDCG.

information retrieval, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

1906.01874

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
North America > United States > Hawaii (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Distant Learning for Entity Linking with Automatic Noise Detection

Le, Phong, Titov, Ivan

arXiv.org Artificial IntelligenceJun-4-2019

Accurate entity linkers have been produced for domains and languages where annotated data (i.e., texts linked to a knowledge base) is available. However, little progress has been made for the settings where no or very limited amounts of labeled data are present (e.g., legal or most scientific domains). In this work, we show how we can learn to link mentions without having any labeled examples, only a knowledge base and a collection of unannotated texts from the corresponding domain. In order to achieve this, we frame the task as a multi-instance learning problem and rely on surface matching to create initial noisy labels. As the learning signal is weak and our surrogate labels are noisy, we introduce a noise detection component in our model: it lets the model detect and disregard examples which are likely to be noisy. Our method, jointly learning to detect noise and link entities, greatly outperforms the surface matching baseline. For a subset of entity categories, it even approaches the performance of supervised learning.

classifier, information retrieval, machine learning, (17 more...)

arXiv.org Artificial Intelligence

1905.07189

Country:

Europe (1.00)
North America > United States (0.95)
Asia (0.93)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)

Add feedback

Content Word-based Sentence Decoding and Evaluating for Open-domain Neural Response Generation

Zhao, Tianyu, Kawahara, Tatsuya

arXiv.org Artificial IntelligenceMay-31-2019

Various encoder-decoder models have been applied to response generation in open-domain dialogs, but a majority of conventional models directly learn a mapping from lexical input to lexical output without explicitly modeling intermediate representations. Utilizing language hierarchy and modeling intermediate information have been shown to benefit many language understanding and generation tasks. Motivated by Broca's aphasia, we propose to use a content word sequence as an intermediate representation for open-domain response generation. Experimental results show that the proposed method improves content relatedness of produced responses, and our models can often choose correct grammar for generated content words. Meanwhile, instead of evaluating complete sentences, we propose to compute conventional metrics on content word sequences, which is a better indicator of content relevance.

information retrieval, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

1905.13438

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension

Talmor, Alon, Berant, Jonathan

arXiv.org Artificial IntelligenceMay-31-2019

A large number of reading comprehension (RC) datasets has been created recently, but little analysis has been done on whether they generalize to one another, and the extent to which existing datasets can be leveraged for improving performance on new ones. In this paper, we conduct such an investigation over ten RC datasets, training on one or more source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset. We analyze the factors that contribute to generalization, and show that training on a source RC dataset and transferring to a target dataset substantially improves performance, even in the presence of powerful contextual representations from BERT (Devlin et al., 2019). We also find that training on multiple source RC datasets leads to robust generalization and transfer, and can reduce the cost of example collection for a new RC dataset. Following our analysis, we propose MultiQA, a BERT-based model, trained on multiple RC datasets, which leads to state-of-the-art performance on five RC datasets. We share our infrastructure for the benefit of the research community.

dataset, generalization, rc dataset, (16 more...)

arXiv.org Artificial Intelligence

1905.13453

Country: Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report (1.00)

Industry: Education > Assessment & Standards > Student Performance (0.62)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.46)

Add feedback

DiffQue: Estimating Relative Difficulty of Questions in Community Question Answering Services

Thukral, Deepak, Pandey, Adesh, Gupta, Rishabh, Goyal, Vikram, Chakraborty, Tanmoy

arXiv.org Machine LearningMay-31-2019

Automatic estimation of relative difficulty of a pair of questions is an important and challenging problem in community question answering (CQA) services. There are limited studies which addressed this problem. Past studies mostly leveraged expertise of users answering the questions and barely considered other properties of CQA services such as metadata of users and posts, temporal information and textual content. In this paper, we propose DiffQue, a novel system that maps this problem to a network-aided edge directionality prediction problem. DiffQue starts by constructing a novel network structure that captures different notions of difficulties among a pair of questions. It then measures the relative difficulty of two questions by predicting the direction of a (virtual) edge connecting these two questions in the network. It leverages features extracted from the network structure, metadata of users/posts and textual description of questions and answers. Experiments on datasets obtained from two CQA sites (further divided into four datasets) with human annotated ground-truth show that DiffQue outperforms four state-of-the-art methods by a significant margin (28.77% higher F1 score and 28.72% higher AUC than the best baseline). As opposed to the other baselines, (i) DiffQue appropriately responds to the training noise, (ii) DiffQue is capable of adapting multiple domains (CQA datasets), and (iii) DiffQue can efficiently handle 'cold start' problem which may arise due to the lack of information for newly posted questions or newly arrived users.

information retrieval, machine learning, question answering, (22 more...)

arXiv.org Machine Learning

1906.00145

Country: North America > United States (0.28)

Genre: Research Report > Promising Solution (0.48)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
(2 more...)

Add feedback

Learning to Route in Similarity Graphs

Baranchuk, Dmitry, Persiyanov, Dmitry, Sinitsin, Anton, Babenko, Artem

arXiv.org Machine LearningMay-27-2019

The current approaches for efficient NNS mostly belong to three separate lines of research. The first family of methods, Recently similarity graphs became the leading based on partition trees (Bentley, 1975; Sproull, 1991; paradigm for efficient nearest neighbor search, McCartin-Lim et al., 2012; Dasgupta & Freund, 2008; Dasgupta outperforming traditional tree-based and LSHbased & Sinha, 2013), hierarchically split the search space methods. Similarity graphs perform the into a large number of regions, corresponding to tree leaves, search via greedy routing: a query traverses the and query visits only a limited number of promising regions graph and in each vertex moves to the adjacent when searching. The second, locality-sensitive hashing vertex that is the closest to this query. In practice, methods (Indyk & Motwani, 1998; Datar et al., 2004; Andoni similarity graphs are often susceptible to local & Indyk, 2008; Andoni et al., 2015) map the database minima, when queries do not reach its nearest points into a number of buckets using several hash functions neighbors, getting stuck in suboptimal vertices. In such that the probability of collision is much higher this paper we propose to learn the routing function for nearby points than for points that are further apart. At that overcomes local minima via incorporating information the search stage, a query is also hashed, and distances to about the graph global structure. In particular, all the points from the corresponding buckets are evaluated.

information retrieval, machine learning, vertex, (18 more...)

arXiv.org Machine Learning

1905.10987

Country:

Asia > Russia (0.05)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
(7 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)

Add feedback

TACAM: Topic And Context Aware Argument Mining

Fromm, Michael, Faerman, Evgeniy, Seidl, Thomas

arXiv.org Machine LearningMay-26-2019

In this work we address the problem of argument search. The purpose of argument search is the distillation of pro and contra arguments for requested topics from large text corpora. In previous works, the usual approach is to use a standard search engine to extract text parts which are relevant to the given topic and subsequently use an argument recognition algorithm to select arguments from them. The main challenge in the argument recognition task, which is also known as argument mining, is that often sentences containing arguments are structurally similar to purely informative sentences without any stance about the topic. In fact, they only differ semantically. Most approaches use topic or search term information only for the first search step and therefore assume that arguments can be classified independently of a topic. We argue that topic information is crucial for argument mining, since the topic defines the semantic context of an argument. Precisely, we propose different models for the classification of arguments, which take information about a topic of an argument into account. Moreover, to enrich the context of a topic and to let models understand the context of the potential argument better, we integrate information from different external sources such as Knowledge Graphs or pre-trained NLP models. Our evaluation shows that considering topic information, especially in connection with external information, provides a significant performance boost for the argument mining task.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

1906.00923

Country:

Europe (1.00)
North America > United States (0.46)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Government (0.94)
Education (0.68)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Derived Codebooks for High-Accuracy Nearest Neighbor Search

André, Fabien, Kermarrec, Anne-Marie, Scouarnec, Nicolas Le

arXiv.org Artificial IntelligenceMay-16-2019

High-dimensional Nearest Neighbor (NN) search is central in multimedia search systems. Product Quantization (PQ) is a widespread NN search technique which has a high performance and good scalability. PQ compresses high-dimensional vectors into compact codes thanks to a combination of quantizers. Large databases can, therefore, be stored entirely in RAM, enabling fast responses to NN queries. In almost all cases, PQ uses 8-bit quantizers as they offer low response times. In this paper, we advocate the use of 16-bit quantizers. Compared to 8-bit quantizers, 16-bit quantizers boost accuracy but they increase response time by a factor of 3 to 10. We propose a novel approach that allows 16-bit quantizers to offer the same response time as 8-bit quantizers, while still providing a boost of accuracy. Our approach builds on two key ideas: (i) the construction of derived codebooks that allow a fast and approximate distance evaluation, and (ii) a two-pass NN search procedure which builds a candidate set using the derived codebooks, and then refines it using 16-bit quantizers. On 1 billion SIFT vectors, with an inverted index, our approach offers a Recall@100 of 0.85 in 5.2 ms. By contrast, 16-bit quantizers alone offer a Recall@100 of 0.85 in 39 ms, and 8-bit quantizers a Recall@100 of 0.82 in 3.8 ms.

information retrieval, machine learning, quantizer, (20 more...)

arXiv.org Artificial Intelligence

1905.069

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.89)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.61)

Add feedback

Opening Up the Black Box: Auditing Google's Top Stories Algorithm

Lurie, Emma (Wellesley College) | Mustafaraj, Eni (Wellesley College)

AAAI ConferencesMay-15-2019

Auditing algorithms has emerged as a methodology for holding algorithms accountable by testing whether they are fair. This process often relies on the repeated use of a platform to record inputs and their corresponding outputs. For example, to audit Google search, one repeatedly inputs queries and captures the received search pages. The goal is then to discover, in the collected data, patterns that will reveal the ``secrets'' of algorithmic decision making. This knowledge discovery process makes some algorithm auditing tasks great applications for data mining techniques. In this paper, we introduce one particular algorithm audit, that of Google's Top stories. We describe the process of data collection, exploration, and analysis for this application and share some of the gleaned insights. Concretely, our analysis suggests that Google might be trying to burst the famous ``filter bubble'' by choosing less known publishers for the 3rd position in the Top stories.

information retrieval, machine learning, natural language, (19 more...)

AAAI Conferences

The Thirty-Second International Flairs Conference

Country:

North America > United States > Nevada > Clark County > Las Vegas (0.04)
Africa > Middle East > Egypt (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Norfolk County > Wellesley (0.04)

Industry:

Media > News (1.00)
Government > Regional Government > North America Government > United States Government (0.94)
Information Technology (0.90)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
(2 more...)

Add feedback