Text Processing


AI and the Case of the Disappearing Textbooks

#artificialintelligence

Artificial Intelligence is making the transition to electronic-only publishing a necessity for textbook publishers. In a recent story, the BBC reported on how Pearsons, one of the largest textbook publishing companies in the world, is getting out of the print business. This is very much along the lines of Ford Motor Company announcing recently that they will stop producing cars. While the jury is still out on whether the latter is a good idea, in many respects. It is a matter of economics.


Unsupervised Question Answering by Cloze Translation

arXiv.org Artificial Intelligence

Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we explore to what extent high quality training data is actually required for Extractive QA, and investigate the possibility of unsupervised Extractive QA. We approach this problem by first learning to generate context, question and answer triples in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically. To generate such triples, we first sample random context paragraphs from a large corpus of documents and then random noun phrases or named entity mentions from these paragraphs as answers. Next we convert answers in context to "fill-in-the-blank" cloze questions and finally translate them into natural questions. We propose and compare various unsupervised ways to perform cloze-to-natural question translation, including training an unsupervised NMT model using non-aligned corpora of natural questions and cloze questions as well as a rule-based approach. We find that modern QA models can learn to answer human questions surprisingly well using only synthetic training data. We demonstrate that, without using the SQuAD training data at all, our approach achieves 56.4 F1 on SQuAD v1 (64.5 F1 when the answer is a Named entity mention), outperforming early supervised models.


Question Answering as Global Reasoning over Semantic Abstractions

arXiv.org Artificial Intelligence

We propose a novel method for exploiting the semantic structure of text to answer multiple-choice questions. The approach is especially suitable for domains that require reasoning over a diverse set of linguistic constructs but have limited training data. To address these challenges, we present the first system, to the best of our knowledge, that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pre-trained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, we translate question answering (QA) into a search for an optimal subgraph that satisfies certain global and local properties. This formulation generalizes several prior structured QA systems. Our system, SEMANTICILP, demonstrates strong performance on two domains simultaneously. In particular, on a collection of challenging science QA datasets, it outperforms various state-of-the-art approaches, including neural models, broad coverage information retrieval, and specialized techniques using structured knowledge bases, by 2%-6%.


Cognitive Services Text Analytics' Named Entity Recognition is now available Azure updates Microsoft Azure

#artificialintelligence

We are happy to announce the general availability of Named Entity Recognition supporting English and Spanish languages as part of Azure Cognitive Services Text Analytics API. Named Entity Recognition (NER) is the ability to take free-form text and identify the occurrences of entities such as people, locations, organizations, and more. With a simple API call, NER in Text Analytics uses robust machine learning models to find and categorize more than twenty types of named entities in any text document.


Ambiverse - an amazing open-source suite for natural language understanding

#artificialintelligence

While doing performance benchmarks for Named Entity Linking solutions for our AI/FinTech start-up Risklio, I stumbled upon a very powerful, only just open-sourced framework called AmbiverseNLU. It was developed by Ambiverse and is based on work previously done at the Max Planck Institute¹. The components it uses are more well-known: entity recognition from KnowNER², open information extraction using ClausIE³ and AIDA, an entity detection and disambiguation tool⁴. You can have a look at the demo here. For the former one you can choose whether to use Apache Cassandra or PostgreSQL as a backend, while the last one uses Neo4j.


Ambiverse - an amazing open-source suite for natural language understanding

#artificialintelligence

While doing performance benchmarks for Named Entity Linking solutions for our AI/FinTech start-up Risklio, I stumbled upon a very powerful, only just open-sourced framework called AmbiverseNLU. It was developed by Ambiverse and is based on work previously done at the Max Planck Institute¹. The components it uses are more well-known: entity recognition from KnowNER², open information extraction using ClausIE³ and AIDA, an entity detection and disambiguation tool⁴. You can have a look at the demo here. For the former one you can choose whether to use Apache Cassandra or PostgreSQL as a backend, while the last one uses Neo4j.


Coherent Comment Generation for Chinese Articles with a Graph-to-Sequence Model

arXiv.org Artificial Intelligence

Automatic article commenting is helpful in encouraging user engagement and interaction on online news platforms. However, the news documents are usually too long for traditional encoder-decoder based models, which often results in general and irrelevant comments. In this paper, we propose to generate comments with a graph-to-sequence model that models the input news as a topic interaction graph. By organizing the article into graph structure, our model can better understand the internal structure of the article and the connection between topics, which makes it better able to understand the story. We collect and release a large scale news-comment corpus from a popular Chinese online news platform Tencent Kuaibao. Extensive experiment results show that our model can generate much more coherent and informative comments compared with several strong baseline models.


Biomedical Named Entity Recognition via Reference-Set Augmented Bootstrapping

arXiv.org Machine Learning

We present a weakly-supervised data augmentation approach to improve Named Entity Recognition (NER) in a challenging domain: extracting biomedical entities (e.g., proteins) from the scientific literature. First, we train a neural NER (NNER) model over a small seed of fully-labeled examples. Second, we use a reference set of entity names (e.g., proteins in UniProt) to identify entity mentions with high precision, but low recall, on an unlabeled corpus. Third, we use the NNER model to assign weak labels to the corpus. Finally, we retrain our NNER model iteratively over the augmented training set, including the seed, the reference-set examples, and the weakly-labeled examples, which improves model performance. We show empirically that this augmented bootstrapping process significantly improves NER performance, and discuss the factors impacting the efficacy of the approach.


NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields

#artificialintelligence

In the world of Natural Language Processing (NLP), the most basic models are based on Bag of Words. But such models fail to capture the syntactic relations between words. For example, suppose we build a sentiment analyser based on only Bag of Words. Such a model will not be able to capture the difference between "I like you", where "like" is a verb with a positive sentiment, and "I am like you", where "like" is a preposition with a neutral sentiment. So this leaves us with a question -- how do we improve on this Bag of Words technique?


Learning Multilingual Word Embeddings Using Image-Text Data

arXiv.org Artificial Intelligence

There has been significant interest recently in learning multilingual word embeddings -- in which semantically similar words across languages have similar embeddings. State-of-the-art approaches have relied on expensive labeled data, which is unavailable for low-resource languages, or have involved post-hoc unification of monolingual embeddings. In the present paper, we investigate the efficacy of multilingual embeddings learned from weakly-supervised image-text data. In particular, we propose methods for learning multilingual embeddings using image-text data, by enforcing similarity between the representations of the image and that of the text. Our experiments reveal that even without using any expensive labeled data, a bag-of-words-based embedding model trained on image-text data achieves performance comparable to the state-of-the-art on crosslingual semantic similarity tasks.