Collaborating Authors

Text Mining

PhD Position Human-Centered Information Extraction from City Archival Data


The Knowledge and Intelligence Design section in the Department of Sustainable Design Engineering of the Faculty of Industrial Design Engineering (IDE) offers a PhD position for a duration of four years. The PhD candidate will be supervised by Prof. Alessandro Bozzon. The research work will be conducted in the context of a collaboration between TU Delft, the Amsterdam City Archive, and the CTO Office of Municipality of Amsterdam. The goal is to investigate human-centered artificial intelligence methods for the preservation of large collections of archival documents which are a valuable source of knowledge for cultural, social and urban research of a given city. To fully unlock the knowledge contained in the archives and facilitate the exploration and exploitation of the collections, there is a need for techniques to digitize the archives; to extract structured data, namely Named Entities (NEs) such as persons, locations, events, from unstructured archival documents; and to link the extracted entities to knowledge bases.

8 Open-source/ Free Text Mining and Text Analysis solutions


Ever wanted to analyze text documents for documents or articles? There are several tools, web services that provide such services but what about desktop programs? So here in this article, we have collected several tools to help you achieve that, and even more, they are free and open-source as well. We will try to list the specific and unique features per item to make it easy for our readers to pick what they need. Orange is an open-source platform for machine learning, data analysis, text mining and data visualization.

HubofML - Newsletter #8


A comprehensive overview of techniques for structured key-value pair information extraction from invoices. The post reviews research papers that explore data extraction and touch upon how to get started implementing the methods.

Digitization, Digital Transformations and Humans in the Loop Workflows


This article will take you through what digital transformations are, what drives it, how to aid successful digital transformations, how AI and deep learning can help, the challenges you might face in implementation and how to work around them. We will also talk about what the current pace of technological growth means for the future of work and what we can do about the paranoia that goes along with increasing automation. While talking about singularity or Skynet taking over is not the point of this blog, it would be a little apathetic to not acknowledge the risks that come with acceleration in technological advancement. Have a data extraction problem in mind? Head over to Nanonets and start building models for free!

Text Mining with R: The Free eBook - KDnuggets


I readily admit that I'm biased toward Python. This isn't intentional -- such is the case with many biases -- but coming from a computer science background and having been programming since a very young age, I have naturally tended towards general purpose programming languages (Java, C, C, Python, etc.). This is the major reason that Python books and resources are at the forefront of my radar, recommendations, and reviews. Obviously, however, not all data scientists are in this same position, given that there are innumerable paths to data science. Given that, and since R is powerful and popular programming language for a large swath of data scientists, today let's take a look at a book which uses R as a tool to implement solutions to data science problems.

Text Mining with R


This is the website for Text Mining with R! Visit the GitHub repository for this site, find the book at O'Reilly, or buy it on Amazon.

Explaining black-box text classifiers for disease-treatment information extraction Artificial Intelligence

Deep neural networks and other intricate Artificial Intelligence (AI) models have reached high levels of accuracy on many biomedical natural language processing tasks. However, their applicability in real-world use cases may be limited due to their vague inner working and decision logic. A post-hoc explanation method can approximate the behavior of a black-box AI model by extracting relationships between feature values and outcomes. In this paper, we introduce a post-hoc explanation method that utilizes confident itemsets to approximate the behavior of black-box classifiers for medical information extraction. Incorporating medical concepts and semantics into the explanation process, our explanator finds semantic relations between inputs and outputs in different parts of the decision space of a black-box classifier. The experimental results show that our explanation method can outperform perturbation and decision set based explanators in terms of fidelity and interpretability of explanations produced for predictions on a disease-treatment information extraction task.

MITA: An Information-Extraction Approach to the Analysis of Free-Form Text in Life Insurance Applications

AI Magazine

MetLife processes over 260,000 life insurance applications a year. Underwriting of these applications is labor intensive. Automation is difficult because the applications include many free-form text fields. MetLife's intelligent text analyzer (MITA) uses the information-extraction technique of natural language processing to structure the extensive textual fields on a life insurance application. Knowledge engineering, with the help of underwriters as domain experts, was performed to elicit significant concepts for both medical and occupational textual fields.

A survey on natural language processing (nlp) and applications in insurance Machine Learning

Text is the most widely used means of communication today. This data is abundant but nevertheless complex to exploit within algorithms. For years, scientists have been trying to implement different techniques that enable computers to replicate some mechanisms of human reading. During the past five years, research disrupted the capacity of the algorithms to unleash the value of text data. It brings today, many opportunities for the insurance industry.Understanding those methods and, above all, knowing how to apply them is a major challenge and key to unleash the value of text data that have been stored for many years. Processing language with computer brings many new opportunities especially in the insurance sector where reports are central in the information used by insurers. SCOR's Data Analytics team has been working on the implementation of innovative tools or products that enable the use of the latest research on text analysis. Understanding text mining techniques in insurance enhances the monitoring of the underwritten risks and many processes that finally benefit policyholders.This article proposes to explain opportunities that Natural Language Processing (NLP) are providing to insurance. It details different methods used today in practice traces back the story of them. We also illustrate the implementation of certain methods using open source libraries and python codes that we have developed to facilitate the use of these techniques.After giving a general overview on the evolution of text mining during the past few years,we share about how to conduct a full study with text mining and share some examples to serve those models into insurance products or services. Finally, we explained in more details every step that composes a Natural Language Processing study to ensure the reader can have a deep understanding on the implementation.

Tag and Correct: Question aware Open Information Extraction with Two-stage Decoding Artificial Intelligence

Question Aware Open Information Extraction (Question aware Open IE) takes question and passage as inputs, outputting an answer tuple which contains a subject, a predicate, and one or more arguments. Each field of answer is a natural language word sequence and is extracted from the passage. The semi-structured answer has two advantages which are more readable and falsifiable compared to span answer. There are two approaches to solve this problem. One is an extractive method which extracts candidate answers from the passage with the Open IE model, and ranks them by matching with questions. It fully uses the passage information at the extraction step, but the extraction is independent to the question. The other one is the generative method which uses a sequence to sequence model to generate answers directly. It combines the question and passage as input at the same time, but it generates the answer from scratch, which does not use the facts that most of the answer words come from in the passage. To guide the generation by passage, we present a two-stage decoding model which contains a tagging decoder and a correction decoder. At the first stage, the tagging decoder will tag keywords from the passage. At the second stage, the correction decoder will generate answers based on tagged keywords. Our model could be trained end-to-end although it has two stages. Compared to previous generative models, we generate better answers by generating coarse to fine. We evaluate our model on WebAssertions (Yan et al., 2018) which is a Question aware Open IE dataset. Our model achieves a BLEU score of 59.32, which is better than previous generative methods.