Information Retrieval
Ambiverse - an amazing open-source suite for natural language understanding
While doing performance benchmarks for Named Entity Linking solutions for our AI/FinTech start-up Risklio, I stumbled upon a very powerful, only just open-sourced framework called AmbiverseNLU. It was developed by Ambiverse and is based on work previously done at the Max Planck Instituteยน. The components it uses are more well-known: entity recognition from KnowNERยฒ, open information extraction using ClausIEยณ and AIDA, an entity detection and disambiguation toolโด. You can have a look at the demo here. For the former one you can choose whether to use Apache Cassandra or PostgreSQL as a backend, while the last one uses Neo4j.
Terminology-based Text Embedding for Computing Document Similarities on Technical Content
Mirisaee, Hamid, Gaussier, Eric, Lagnier, Cedric, Guerraz, Agnes
We propose in this paper a new, hybrid document embedding approach in order to address the problem of document similarities with respect to the technical content. To do so, we employ a state-of-the-art graph techniques to first extract the keyphrases (composite keywords) of documents and, then, use them to score the sentences. Using the ranked sentences, we propose two approaches to embed documents and show their performances with respect to two baselines. With domain expert annotations, we illustrate that the proposed methods can find more relevant documents and outperform the baselines up to 27% in terms of NDCG.
Distant Learning for Entity Linking with Automatic Noise Detection
Accurate entity linkers have been produced for domains and languages where annotated data (i.e., texts linked to a knowledge base) is available. However, little progress has been made for the settings where no or very limited amounts of labeled data are present (e.g., legal or most scientific domains). In this work, we show how we can learn to link mentions without having any labeled examples, only a knowledge base and a collection of unannotated texts from the corresponding domain. In order to achieve this, we frame the task as a multi-instance learning problem and rely on surface matching to create initial noisy labels. As the learning signal is weak and our surrogate labels are noisy, we introduce a noise detection component in our model: it lets the model detect and disregard examples which are likely to be noisy. Our method, jointly learning to detect noise and link entities, greatly outperforms the surface matching baseline. For a subset of entity categories, it even approaches the performance of supervised learning.
Content Word-based Sentence Decoding and Evaluating for Open-domain Neural Response Generation
Zhao, Tianyu, Kawahara, Tatsuya
Various encoder-decoder models have been applied to response generation in open-domain dialogs, but a majority of conventional models directly learn a mapping from lexical input to lexical output without explicitly modeling intermediate representations. Utilizing language hierarchy and modeling intermediate information have been shown to benefit many language understanding and generation tasks. Motivated by Broca's aphasia, we propose to use a content word sequence as an intermediate representation for open-domain response generation. Experimental results show that the proposed method improves content relatedness of produced responses, and our models can often choose correct grammar for generated content words. Meanwhile, instead of evaluating complete sentences, we propose to compute conventional metrics on content word sequences, which is a better indicator of content relevance.
MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension
Talmor, Alon, Berant, Jonathan
A large number of reading comprehension (RC) datasets has been created recently, but little analysis has been done on whether they generalize to one another, and the extent to which existing datasets can be leveraged for improving performance on new ones. In this paper, we conduct such an investigation over ten RC datasets, training on one or more source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset. We analyze the factors that contribute to generalization, and show that training on a source RC dataset and transferring to a target dataset substantially improves performance, even in the presence of powerful contextual representations from BERT (Devlin et al., 2019). We also find that training on multiple source RC datasets leads to robust generalization and transfer, and can reduce the cost of example collection for a new RC dataset. Following our analysis, we propose MultiQA, a BERT-based model, trained on multiple RC datasets, which leads to state-of-the-art performance on five RC datasets. We share our infrastructure for the benefit of the research community.
DiffQue: Estimating Relative Difficulty of Questions in Community Question Answering Services
Thukral, Deepak, Pandey, Adesh, Gupta, Rishabh, Goyal, Vikram, Chakraborty, Tanmoy
Automatic estimation of relative difficulty of a pair of questions is an important and challenging problem in community question answering (CQA) services. There are limited studies which addressed this problem. Past studies mostly leveraged expertise of users answering the questions and barely considered other properties of CQA services such as metadata of users and posts, temporal information and textual content. In this paper, we propose DiffQue, a novel system that maps this problem to a network-aided edge directionality prediction problem. DiffQue starts by constructing a novel network structure that captures different notions of difficulties among a pair of questions. It then measures the relative difficulty of two questions by predicting the direction of a (virtual) edge connecting these two questions in the network. It leverages features extracted from the network structure, metadata of users/posts and textual description of questions and answers. Experiments on datasets obtained from two CQA sites (further divided into four datasets) with human annotated ground-truth show that DiffQue outperforms four state-of-the-art methods by a significant margin (28.77% higher F1 score and 28.72% higher AUC than the best baseline). As opposed to the other baselines, (i) DiffQue appropriately responds to the training noise, (ii) DiffQue is capable of adapting multiple domains (CQA datasets), and (iii) DiffQue can efficiently handle 'cold start' problem which may arise due to the lack of information for newly posted questions or newly arrived users.
Learning to Route in Similarity Graphs
Baranchuk, Dmitry, Persiyanov, Dmitry, Sinitsin, Anton, Babenko, Artem
The current approaches for efficient NNS mostly belong to three separate lines of research. The first family of methods, Recently similarity graphs became the leading based on partition trees (Bentley, 1975; Sproull, 1991; paradigm for efficient nearest neighbor search, McCartin-Lim et al., 2012; Dasgupta & Freund, 2008; Dasgupta outperforming traditional tree-based and LSHbased & Sinha, 2013), hierarchically split the search space methods. Similarity graphs perform the into a large number of regions, corresponding to tree leaves, search via greedy routing: a query traverses the and query visits only a limited number of promising regions graph and in each vertex moves to the adjacent when searching. The second, locality-sensitive hashing vertex that is the closest to this query. In practice, methods (Indyk & Motwani, 1998; Datar et al., 2004; Andoni similarity graphs are often susceptible to local & Indyk, 2008; Andoni et al., 2015) map the database minima, when queries do not reach its nearest points into a number of buckets using several hash functions neighbors, getting stuck in suboptimal vertices. In such that the probability of collision is much higher this paper we propose to learn the routing function for nearby points than for points that are further apart. At that overcomes local minima via incorporating information the search stage, a query is also hashed, and distances to about the graph global structure. In particular, all the points from the corresponding buckets are evaluated.
TACAM: Topic And Context Aware Argument Mining
Fromm, Michael, Faerman, Evgeniy, Seidl, Thomas
In this work we address the problem of argument search. The purpose of argument search is the distillation of pro and contra arguments for requested topics from large text corpora. In previous works, the usual approach is to use a standard search engine to extract text parts which are relevant to the given topic and subsequently use an argument recognition algorithm to select arguments from them. The main challenge in the argument recognition task, which is also known as argument mining, is that often sentences containing arguments are structurally similar to purely informative sentences without any stance about the topic. In fact, they only differ semantically. Most approaches use topic or search term information only for the first search step and therefore assume that arguments can be classified independently of a topic. We argue that topic information is crucial for argument mining, since the topic defines the semantic context of an argument. Precisely, we propose different models for the classification of arguments, which take information about a topic of an argument into account. Moreover, to enrich the context of a topic and to let models understand the context of the potential argument better, we integrate information from different external sources such as Knowledge Graphs or pre-trained NLP models. Our evaluation shows that considering topic information, especially in connection with external information, provides a significant performance boost for the argument mining task.
Derived Codebooks for High-Accuracy Nearest Neighbor Search
Andrรฉ, Fabien, Kermarrec, Anne-Marie, Scouarnec, Nicolas Le
High-dimensional Nearest Neighbor (NN) search is central in multimedia search systems. Product Quantization (PQ) is a widespread NN search technique which has a high performance and good scalability. PQ compresses high-dimensional vectors into compact codes thanks to a combination of quantizers. Large databases can, therefore, be stored entirely in RAM, enabling fast responses to NN queries. In almost all cases, PQ uses 8-bit quantizers as they offer low response times. In this paper, we advocate the use of 16-bit quantizers. Compared to 8-bit quantizers, 16-bit quantizers boost accuracy but they increase response time by a factor of 3 to 10. We propose a novel approach that allows 16-bit quantizers to offer the same response time as 8-bit quantizers, while still providing a boost of accuracy. Our approach builds on two key ideas: (i) the construction of derived codebooks that allow a fast and approximate distance evaluation, and (ii) a two-pass NN search procedure which builds a candidate set using the derived codebooks, and then refines it using 16-bit quantizers. On 1 billion SIFT vectors, with an inverted index, our approach offers a Recall@100 of 0.85 in 5.2 ms. By contrast, 16-bit quantizers alone offer a Recall@100 of 0.85 in 39 ms, and 8-bit quantizers a Recall@100 of 0.82 in 3.8 ms.
Opening Up the Black Box: Auditing Google's Top Stories Algorithm
Lurie, Emma (Wellesley College) | Mustafaraj, Eni (Wellesley College)
Auditing algorithms has emerged as a methodology for holding algorithms accountable by testing whether they are fair. This process often relies on the repeated use of a platform to record inputs and their corresponding outputs. For example, to audit Google search, one repeatedly inputs queries and captures the received search pages. The goal is then to discover, in the collected data, patterns that will reveal the ``secrets'' of algorithmic decision making. This knowledge discovery process makes some algorithm auditing tasks great applications for data mining techniques. In this paper, we introduce one particular algorithm audit, that of Google's Top stories. We describe the process of data collection, exploration, and analysis for this application and share some of the gleaned insights. Concretely, our analysis suggests that Google might be trying to burst the famous ``filter bubble'' by choosing less known publishers for the 3rd position in the Top stories.