Goto

Collaborating Authors

 Information Retrieval


Blender Bot -- Part 3: The Many Architectures

#artificialintelligence

We have been looking into Facebook's open-sourced conversational offering, Blender Bot. In Part-1 we went over in detail about the DataSets used in the pre-training and fine-tuning of it and the failure cases as well as limitations of Blender. And in Part-2 we studied the more generic problem setting of "Multi-Sentence Scoring", the Transformer architectures used for such a task and learnt about the Poly-Encoders in particular -- which will be used to provide the encoder representations in Blender. In this 3rd and final part, we return from our respite with Poly-Encoders, back to Blender. We shall go over the different Model Architectures, their respective training objectives, the Evaluation methods and performance of Blender in comparison to Meena.


DuckDuckGo down in India: Private browser mysteriously stops working

The Independent - Tech

Privacy-focused search engine DuckDuckGo has reported that its service is not working in India. "To our users in India: We've received many reports our search engine is unreachable by much of India right now and have confirmed it is not due to us," the company tweeted. "We're actively talking to Internet providers to get to the bottom of it ASAP. Thank you for your patience." It is unclear why DuckDuckGo would be unavailable in the country.


Answering Questions on COVID-19 in Real-Time

arXiv.org Artificial Intelligence

The recent outbreak of the novel coronavirus is wreaking havoc on the world and researchers are struggling to effectively combat it. One reason why the fight is difficult is due to the lack of information and knowledge. In this work, we outline our effort to contribute to shrinking this knowledge vacuum by creating covidAsk, a question answering (QA) system that combines biomedical text mining and QA techniques to provide answers to questions in real-time. Our system leverages both supervised and unsupervised approaches to provide informative answers using DenSPI (Seo et al., 2019) and BEST (Lee et al., 2016). Evaluation of covidAsk is carried out by using a manually created dataset called COVID-19 Questions which is based on facts about COVID-19. We hope our system will be able to aid researchers in their search for knowledge and information not only for COVID-19 but for future pandemics as well.


Twitter Will Check if Articles Are Read Before Sharing - Search Engine Journal

#artificialintelligence

Twitter announced a new feature that encourages Android Twitter users to read an article before reading it. This raised suspicions that Twitter was tracking user clicks. The move is part of Twitter's stated goal to encourage "informed discussion." Often people share a link without reading the article. That results in click bait titles getting widely promoted regardless of the content.


A Flexible Framework for Entity Resolution

#artificialintelligence

A critical component of data management and enrichment pipelines is connecting large datasets from various sources to form a holistic view; to make connections between entities across data sources. Oftentimes, these entities -- such as individuals, organizations, or addresses -- may not have a unique identifier that can be used as a key to detect duplicates or to merge datasets on. ThinkData has developed a scalable entity resolution engine to solve these problems. After experimenting with both deep learning and traditional NLP techniques, the team has found the best balance of accuracy and performance. Specifically, we have achieved near-parity in accuracy compared to Magellan (the leading entity resolution project in research), albeit with much better performance metrics and greater scalability.


Pre-training via Paraphrasing

arXiv.org Machine Learning

We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked language modeling paradigm, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize the likelihood of generating the original. We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization. The objective noisily captures aspects of paraphrase, translation, multi-document summarization, and information retrieval, allowing for strong zero-shot performance on several tasks. For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation. We further show that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.


DSC Data Science Search Engine

#artificialintelligence

Productive, Self-Service Data Science - June 30 Data science is a core part of an organization's digital transformation strategy. In this latest DSC webinar discover how American Family Insurance's use of the Alation Data Catalog is enabling more productive data science outcomes with trusted, curated data. Productive, Self-Service Data Science - June 30 Data science is a core part of an organization's digital transformation strategy. In this latest DSC webinar discover how American Family Insurance's use of the Alation Data Catalog is enabling more productive data science outcomes with trusted, curated data.


Evaluating Your Learning to Rank Model: Dos and Don'ts in Offline/Onlโ€ฆ

#artificialintelligence

Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems. With LTR becoming more and more popular (Apache Solr supports it from Jan 2017 and Elasticsearch has an Open Source plugin released in 2018), organizations struggle with the problem of how to evaluate the quality of the models they train. This talk explores all the major points in both Offline and Online evaluation. Setting up correct infrastructures and processes for a fair and effective evaluation of the trained models is vital for measuring the improvements/regressions of a LTR system. The talk is intended for: โ€“ Product Owners, Search Managers, Business Owners โ€“ Software Engineers, Data Scientists, and Machine Learning Enthusiast Expect to learn: the importance of Offline testing from a business perspective how Offline testing can be done with Open Source libraries how to build a realistic test set from the original data set in input avoiding common mistakes in the process the importance of Online testing from a business perspective A/B testing and Interleaving approaches: details and Pros/ Cons common mistakes and how they can false the obtained results Join us as we explore real-world scenarios and dos and don'ts from the e-commerce industry!


Semantic Linking Maps for Active Visual Object Search

arXiv.org Artificial Intelligence

We aim for mobile robots to function in a variety of common human environments. Such robots need to be able to reason about the locations of previously unseen target objects. Landmark objects can help this reasoning by narrowing down the search space significantly. More specifically, we can exploit background knowledge about common spatial relations between landmark and target objects. For example, seeing a table and knowing that cups can often be found on tables aids the discovery of a cup. Such correlations can be expressed as distributions over possible pairing relationships of objects. In this paper, we propose an active visual object search strategy method through our introduction of the Semantic Linking Maps (SLiM) model. SLiM simultaneously maintains the belief over a target object's location as well as landmark objects' locations, while accounting for probabilistic inter-object spatial relations. Based on SLiM, we describe a hybrid search strategy that selects the next best view pose for searching for the target object based on the maintained belief. We demonstrate the efficiency of our SLiM-based search strategy through comparative experiments in simulated environments. We further demonstrate the real-world applicability of SLiM-based search in scenarios with a Fetch mobile manipulation robot.


CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization

arXiv.org Artificial Intelligence

The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. As of May 2020, 128,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset Challenge [23]. Here we present CO-Search, a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers during a time of crisis. The retriever is built from a Siamese-BERT[18] encoder that is linearly composed with a TF-IDF vectorizer [19], and reciprocal-rank fused [5] with a BM25 vectorizer. The ranker is composed of a multi-hop question-answering module[1], that together with a multi-paragraph abstractive summarizer adjust retriever scores. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations, creating 1.3 million (citation title, paragraph) tuples for training the encoder. We evaluate our system on the data of the TREC-COVID[22] information retrieval challenge. CO-Search obtains top performance on the datasets of the first and second rounds, across several key metrics: normalized discounted cumulative gain, precision, mean average precision, and binary preference.