The Use of NLP to Extract Unstructured Medical Data From Text - insideBIGDATA


When working in healthcare, a lot of the relevant information for making accurate predictions and recommendations is only available in free-text clinical notes. Much of this data is trapped in free-text documents in unstructured form. This data is needed in order to make healthcare decisions. Hence, it is important to be able to extract data in the best possible way such that the information obtained can be analyzed and used. State-of-the-art NLP algorithms can extract clinical data from text using deep learning techniques such as healthcare-specific word embeddings, named entity recognition models, and entity resolution models.

Latent Dirichlet Allocation Uncovers Spectral Characteristics of Drought Stressed Plants Machine Learning

Understanding the adaptation process of plants to drought stress is essential in improving management practices, breeding strategies as well as engineering viable crops for a sustainable agriculture in the coming decades. Hyper-spectral imaging provides a particularly promising approach to gain such understanding since it allows to discover non-destructively spectral characteristics of plants governed primarily by scattering and absorption characteristics of the leaf internal structure and biochemical constituents. Several drought stress indices have been derived using hyper-spectral imaging. However, they are typically based on few hyper-spectral images only, rely on interpretations of experts, and consider few wavelengths only. In this study, we present the first data-driven approach to discovering spectral drought stress indices, treating it as an unsupervised labeling problem at massive scale. To make use of short range dependencies of spectral wavelengths, we develop an online variational Bayes algorithm for latent Dirichlet allocation with convolved Dirichlet regularizer. This approach scales to massive datasets and, hence, provides a more objective complement to plant physiological practices. The spectral topics found conform to plant physiological knowledge and can be computed in a fraction of the time compared to existing LDA approaches.

Knowledge Acquisition for Question Answering

AAAI Conferences

Questions can be classified based on their degree of difficulty. As the level of difficulty increases, question answering systems need to rely on richer semantic ontoiogies and larger knowledge bases. This paper is concerned with questions whose answers are spread across several documents and thus, require answer fusion. To find such answers, the system needs to develop domain specific ontologies. A method is presented for online acquisition of ontological information from the document collection.

A Large Margin Approach to Anaphora Resolution for Neuroscience Knowledge Discovery

AAAI Conferences

A discriminative large margin classifier based approach to anaphora resolution for neuroscience abstracts is presented. The system employs both syntactic and semantic features. A support vector machine based word sense disambiguation method combining evidence from three methods, that use WordNet and Wikipedia, is also introduced and used for semantic features. The support vector machine anaphora resolution classifier with probabilistic outputs achieved almost four-fold improvement in accuracy over the baseline method.

Analyzing NIH Funding Patterns over Time with Statistical Text Analysis

AAAI Conferences

In the past few years various government funding organizations such as the U.S. National Institutes of Health and the U.S.\ National Science Foundation have provided access to large publicly-available online databases documenting the grants that they have funded over the past few decades. These databases provide an excellent opportunity for the application of statistical text analysis techniques to infer useful quantitative information about how funding patterns have changed over time. In this paper we analyze data from the National Cancer Institute (part of National Institutes of Health) and show how text classification techniques provide a useful starting point for analyzing how funding for cancer research has evolved over the past 20 years in the United States.