Information Retrieval
Amazon Comprehend adds five new languages to Custom Entity Recognition
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to analyze text documents and identify insights such as sentiment, entities, and topics from text. You can use Custom Entity Recognition to identify terms that are specific to your domain. For example, you can instantly extract product names, financial entities or any term relevant to you from unstructured text documents. Starting today, Amazon Comprehend is adding support for the following five new languages to Custom Entity Recognition: French, German, Italian, Portuguese, and Spanish.
Researchers claim bias in AI named entity recognition models
Twitter researchers claim to have found evidence of demographic bias in named entity recognition, the first step toward generating automated knowledge bases, or the repositories leveraged by services like search engines. They say their analysis reveals AI performs better at identifying names from specific groups and the biases manifest in syntax, semantics, and how word uses vary across linguistic contexts. Knowledge bases are essentially databases containing information about entities -- people, places, and things. In 2012, Google launched a knowledge base -- the Knowledge Graph -- to enhance Google search results with hundreds of billions of facts gathered from sources including Wikipedia, Wikidata, and CIA World Factbook. Microsoft provides a knowledge base with over 150,000 articles created by support professionals who've resolved issues for its customers. But while the usefulness of knowledge bases is not in dispute, the researchers assert the embeddings used to represent entities in them exhibit bias against certain groups of people.
(Almost) All of Entity Resolution
Binette, Olivier, Steorts, Rebecca C.
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
Extracting Keywords from Open-Ended Business Survey Questions
McGillivray, Barbara, Jenset, Gard, Heil, Dominik
Open-ended survey data constitute an important basis in research as well as for making business decisions. Collecting and manually analysing free-text survey data is generally more costly than collecting and analysing survey data consisting of answers to multiple-choice questions. Yet free-text data allow for new content to be expressed beyond predefined categories and are a very valuable source of new insights into people's opinions. At the same time, surveys always make ontological assumptions about the nature of the entities that are researched, and this has vital ethical consequences. Human interpretations and opinions can only be properly ascertained in their richness using textual data sources; if these sources are analyzed appropriately, the essential linguistic nature of humans and social entities is safeguarded. Natural Language Processing (NLP) offers possibilities for meeting this ethical business challenge by automating the analysis of natural language and thus allowing for insightful investigations of human judgements. We present a computational pipeline for analysing large amounts of responses to open-ended questions in surveys and extract keywords that appropriately represent people's opinions. This pipeline addresses the need to perform such tasks outside the scope of both commercial software and bespoke analysis, exceeds the performance to state-of-the-art systems, and performs this task in a transparent way that allows for scrutinising and exposing potential biases in the analysis. Following the principle of Open Data Science, our code is open-source and generalizable to other datasets. I CONTEXT AND MOTIVATION Leaders, managers, and decision-makers critically rely on information and feedback. Decisionmakers first need information about the current set of circumstances which provide the context of the decision, and then need feedback on how the decision could play out. To get such information in a format that allows them to appropriately understand the entity they are seeking to comprehend is of critical importance to come to a high-quality decision. Often only qualitative insight into the opinions, interpretations and assumptions of large numbers of people will allow us to understand a set of circumstances properly and are therefore required to make high-quality decisions and consequently outcomes.
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering
Liu, Ye, Chowdhury, Shaika, Zhang, Chenwei, Caragea, Cornelia, Yu, Philip S.
Healthcare question answering assistance aims to provide customer healthcare information, which widely appears in both Web and mobile Internet. The questions usually require the assistance to have proficient healthcare background knowledge as well as the reasoning ability on the knowledge. Recently a challenge involving complex healthcare reasoning, HeadQA dataset, has been proposed, which contains multiple-choice questions authorized for the public healthcare specialization exam. Unlike most other QA tasks that focus on linguistic understanding, HeadQA requires deeper reasoning involving not only knowledge extraction, but also complex reasoning with healthcare knowledge. These questions are the most challenging for current QA systems, and the current performance of the state-of-the-art method is slightly better than a random guess. In order to solve this challenging task, we present a Multi-step reasoning with Knowledge extraction framework (MurKe). The proposed framework first extracts the healthcare knowledge as supporting documents from the large corpus. In order to find the reasoning chain and choose the correct answer, MurKe iterates between selecting the supporting documents, reformulating the query representation using the supporting documents and getting entailment score for each choice using the entailment model. The reformulation module leverages selected documents for missing evidence, which maintains interpretability. Moreover, we are striving to make full use of off-the-shelf pre-trained models. With less trainable weight, the pre-trained model can easily adapt to healthcare tasks with limited training samples. From the experimental results and ablation study, our system is able to outperform several strong baselines on the HeadQA dataset.
Solving One of the Biggest Challenges for AI-Based Search Engines: Relevance
Let's learn how to implement ClickModels in order to extract Relevance from clickstream data. These steps tend to be what is already necessary for implementing an effective enough search engine system for a given application. Eventually, the requirement to upgrade the system to deliver customized results may arise. Doing so should be simple. One could choose from a set of machine learning ranking algorithms, train some selected models, prepare them for production and observe the results.
The Art of SEO: Mastering Search Engine Optimization, 3rd Edition - Programmer Books
Three acknowledged experts in search engine optimization share guidelines and innovative techniques that will help you plan and execute a comprehensive SEO strategy. Novices will receive a thorough SEO education, while experienced SEO practitioners get an extensive reference to support ongoing engagements. Comprehend SEO's many intricacies and complexities Explore the underlying theory and inner workings of search engines Understand the role of social media, user data, and links Discover tools to track results and measure success Examine the effects of Google's Panda and Penguin algorithms Consider opportunities in mobile, local, and vertical SEO Build a competent SEO team with defined roles Glimpse the future of search and the SEO industry
Content Clustering: 50 Tips for Content Planning with Topic Clustering.
Have you decided to tune your business into the next level evolution of SEO? There is a buzz around the content cluster on social media platforms and over the internet. SEO content managers and specialists are always struggling to balance the search engine optimization and content quality in the same quantity and quality. Here are the fantastic tips that you need to know content clustering where the content planning with topic clustering works better. The content clustering is the idea that concentrates on a single point of purpose where the creation of cluster related and interlinking the information through hyperlinks.
NeuralQA: A Usable Library for Question Answering (Contextual Query Expansion + BERT) on Large Datasets
Existing tools for Question Answering (QA) have challenges that limit their use in practice. They can be complex to set up or integrate with existing infrastructure, do not offer configurable interactive interfaces, and do not cover the full set of subtasks that frequently comprise the QA pipeline (query expansion, retrieval, reading, and explanation/sensemaking). To help address these issues, we introduce NeuralQA - a usable library for QA on large datasets. NeuralQA integrates well with existing infrastructure (e.g., ElasticSearch instances and reader models trained with the HuggingFace Transformers API) and offers helpful defaults for QA subtasks. It introduces and implements contextual query expansion (CQE) using a masked language model (MLM) as well as relevant snippets (RelSnip) - a method for condensing large documents into smaller passages that can be speedily processed by a document reader model. Finally, it offers a flexible user interface to support workflows for research explorations (e.g., visualization of gradient-based explanations to support qualitative inspection of model behaviour) and large scale search deployment. Code and documentation for NeuralQA is available as open source on Github.