Genre
BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Tsatsaronis, George (Technische Universität Dresden) | Schroeder, Michael (Technische Universität Dresden) | Paliouras, Georgios (NCSR Demokritos, Athens) | Almirantis, Yannis (NCSR Demokritos, Athens) | Androutsopoulos, Ion (Athens University of Economics and Business) | Gaussier, Eric (Université Joseph Fourier) | Gallinari, Patrick (Université Pierre et Marie Curie LIP6) | Artieres, Thierry (Université Pierre et Marie Curie LIP6) | Alvers, Michael R. (Transinsight GmbH) | Zschunke, Matthias (Transinsight GmbH) | Ngomo, Axel-Cyrille Ngonga (University of Leipzig)
This article provides an overview of BioASQ, a new competition on biomedical semantic indexing and question answering (QA). BioASQ aims to push towards systems that will allow biomedical workers to express their information needs in natural language and that will return concise and user-understandable answers by combining information from multiple sources of different kinds, including biomedical articles, databases, and ontologies. BioASQ encourages participants to adopt semantic indexing as a means to combine multiple information sources and to facilitate the matching of questions to answers. It also adopts a broad semantic indexing and QA architecture that subsumes current relevant approaches, even though no current system instantiates all of its components. Hence, the architecture can also be seen as our view of how relevant work from fields such as information retrieval, hierarchical classification, question answering, ontologies, and linked data can be combined, extended, and applied to biomedical question answering. BioASQ will develop publicly available benchmarks and it will adopt and possibly refine existing evaluation measures. The evaluation infrastructure of the competition will remain publicly available beyond the end of BioASQ.
Efficient Classification of Clinical Reports Utilizing Natural Language Processing
Sarioglu, Efsun (The George Washington University) | Yadav, Kabir (The George Washington University) | Choi, Hyeong-Ah (The George Washington University)
The recent emphasis on health information technology has highlighted the importance of leveraging the large amount of electronic clinical data to help guide medical decision-making. Developing such clinical decision aids requires manual review of many past patient reports in order to generate a good predictive model. In this research, we investigate classification of clinical reports using natural language processing (NLP). The proposed system uses NLP to generate structured output from computed tomography (CT) reports and then machine learning techniques to code for the presence of clinically important injuries for traumatic orbital fracture victims. Our results show that NLP improves upon raw text classification results.
Investigating Twitter as a Source for Studying Behavioral Responses to Epidemics
Lamb, Alex (Johns Hopkins University) | Paul, Michael J. (Johns Hopkins University) | Dredze, Mark (Johns Hopkins University)
Recent studies have shown an ability to track influenza rates from Twitter since Twitter users tweet illnesses (“i am home sick with the flu”). However, users may also tweet concerned awareness of illness (“don’t want to get sick, need a flu shot”). Identifying these messages can support computational epidemic response models. We present preliminary results for mining concerned awareness of influenza tweets. We describe our data set construction and experiments with binary classification of data into influenza versus general messages and classification into concerned awareness and existing infection.
Global and Local Approach of Part-of-Speech Tagging for Large Corpora
Yu, Shi (University of Chicago) | Grossman, Robert (University of Chicago) | Rzhetsky, Andrey (University of Chicago)
We present Global-Local POS tagging, a framework to train generative stochastic Part-of-Speech models on large corpora. Global Taggers offer several advantages over their counter parts trained on small, curated corpus, including the ability to automatically extend and update their models to new text. Global Taggers also avoid a fundamental limitation of current models, whose performance heavily relies on curated text with manually assigned labels. We illustrate our approach by training several Global Taggers, implemented with generative stochastic models, on two large corpora using high performance computing architecture. We further demonstrate that global taggers can be improved by incorporating models trained on curated text, called Local Taggers, for better tagging performance derived from specific topics.
Towards Effective Representation of Clinical Documents for Search and Retrieval
Davis, Anthony R. (3M Health Information Systems) | Nossal, Michael (3M Health Information Systems) | Ober, N. Stephen (3M Health Information Systems)
Recent studies have demonstrated the advantages of structured search of PubMed abstracts when compared with unstructured key word search. We explore whether search on clinical text is similarly enhanced by representing domain specific structures, information, and knowledge. Examples include representations of document structure and sections, local context such as negation, and appropriate modeling of scalar quantities. We examine tasks ranging from recruitment of suitable patients for studies, to chronic disease prevention and management, to longitudinal studies of individual patients or groups, as well as comparative experiments performed on an NLP enhanced clinical search tool that operates on large corpora of clinical text.
PROBE: Periodic Random Orbiter Algorithm for Machine Learning
Smith, Larry (National Institutes of Health) | Kim, Won (National Institutes of Health) | Wilbur, W. John
We present a new algorithm, which we call PROBE, to find the minimum of a convex function. Such a minimization is important in many machine learning methods, including Support Vector Machines (SVM). We show that PROBE is a viable alternative to published algorithms for SVM learning with several important advantages. PROBE is a simple and easily programmed algorithm, with a well-defined, parametrized stopping criterion; it is not limited to SVM, but can be applied to other convex loss functions, such as the Huber and Maximum Entropy models; and its time and memory requirements are consistently modest in handling very large training sets.
OCR-Based Image Features for Biomedical Image and Article Classification: Identifying Documents Relevant to Genomic Cis-Regulatory Elements
Shatkay, Hagit ( University of Delaware ) | Narayanaswamy, Ramya (University of Delaware) | Nagaral, Santosh S. (University of Delaware) | Harrington, Na (Queen's University) | MV, Rohith (University of Delaware) | Somanath, Gowri (University of Delaware) | Tarpine, Ryan (Brown University) | Schutter, Kyle (Brown University) | Johnstone, Tim (Brown University) | Blostein, Dorothea (Queen's University) | Istrail, Sorin (Brown University) | Kambhamettu, Chandra (University of Delaware)
Images form a significant, yet under-utilized, information source in published biomedical articles. Much current work on biomedical image retrieval and classification uses simple, standard image representation employing features such as edge direction or gray scale histograms. In our earlier work we have used such features as well to classify images, where image-class-tags have been used to represent and classify complete articles. Here we focus on a different literature classification task: identifying articles discussing cis-regulatory elements and modules, motivated by the need to understand complex gene-networks. Curators attempting to identify such articles use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (such as gray scale) is highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, forms a novel image representation, which allows us to identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of DNA-rich images within articles, we train a classifier to identify articles pertaining to cis-regulatory elements with a similarly high precision and recall. Using OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, automatically identifying such images is applicable beyond the current use-case, in other important biomedical document classification tasks.
Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions
Paul, Michael J. (Johns Hopkins University) | Dredze, Mark (Johns Hopkins University)
Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of multi-dimensional latent text models, such as factorial LDA, that capture orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interests to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results.
Subgraph Matching-Based Literature Mining for Biomedical Relations and Events
Liu, Haibin (University of Colorado School of Medicine) | Keselj, Vlado (Dalhousie University) | Blouin, Christian (Dalhousie University) | Verspoor, Karin (National ICT Australia)
Extracting important relations between biological components and semantic events involving genes or proteins from literature has become a focus for the biomedical text mining community. In this paper, we review a subgraph matching-based approach proposed in our previous work for mining relations and events in the biomedical literature. Our subgraph matching algorithm is formally presented, along with a detailed analysis of its complexity. We present three different relation/event extraction tasks in which our approach has been successfully applied. Our approach is of considerable value in extracting highly precise, binary relations when appropriate training data is available.
Integration of UMLS and MEDLINE in Unsupervised Word Sense Disambiguation
Yepes, Antonio Jimeno (National Library of Medicine) | Aronson, Alan R. (National Library of Medicine)
Scarcity of training data for word sense disambiguation argues for the use of knowledge-based disambiguation methods, which rely on information available in terminological resources. Unfortunately, these resources are not generally optimized to perform word sense disambiguation. On the other hand, there are many examples of ambiguous biomedical words with context in MEDLINE. However, these examples of ambiguity are not labeled with their proper sense. We propose the integration of the UMLS and MEDLINE to create concept profiles which are used to perform knowledge-based word sense disambiguation. Our results show an accuracy of 0.8770 on a biomedical word sense disambiguation data set; this represents a statistically significant improvement over other knowledge-based methods based on the UMLS on this data set.