Goto

Collaborating Authors

 Industry


Collaborative Biomedical Information Retrieval

AAAI Conferences

In the context of two related NIH projects supporting scientific collaboration we seek to implement an environment for collaborative information retrieval and analysis based on utility theory.


OCR-Based Image Features for Biomedical Image and Article Classification: Identifying Documents Relevant to Genomic Cis-Regulatory Elements

AAAI Conferences

Images form a significant, yet under-utilized, information source in published biomedical articles. Much current work on biomedical image retrieval and classification uses simple, standard image representation employing features such as edge direction or gray scale histograms. In our earlier work we have used such features as well to classify images, where image-class-tags have been used to represent and classify complete articles. Here we focus on a different literature classification task: identifying articles discussing cis-regulatory elements and modules, motivated by the need to understand complex gene-networks. Curators attempting to identify such articles use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (such as gray scale) is highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, forms a novel image representation, which allows us to identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of DNA-rich images within articles, we train a classifier to identify articles pertaining to cis-regulatory elements with a similarly high precision and recall. Using OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, automatically identifying such images is applicable beyond the current use-case, in other important biomedical document classification tasks.


Notes about the OntoGene Pipeline

AAAI Conferences

In this paper we describe the architecture of the OntoGene Relation mining pipeline and some of its recent applications. With this research overview paper we intend to provide a contribution towards the recently started discussion towards standards for information extraction architectures in the biomedical domain. Our approach delivers domain entities mentioned in each input document, as well as candidate relationships, both ranked according to a confidency score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation.


Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions

AAAI Conferences

Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of multi-dimensional latent text models, such as factorial LDA, that capture orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interests to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results.


Subgraph Matching-Based Literature Mining for Biomedical Relations and Events

AAAI Conferences

Extracting important relations between biological components and semantic events involving genes or proteins from literature has become a focus for the biomedical text mining community. In this paper, we review a subgraph matching-based approach proposed in our previous work for mining relations and events in the biomedical literature. Our subgraph matching algorithm is formally presented, along with a detailed analysis of its complexity. We present three different relation/event extraction tasks in which our approach has been successfully applied. Our approach is of considerable value in extracting highly precise, binary relations when appropriate training data is available.


Integration of UMLS and MEDLINE in Unsupervised Word Sense Disambiguation

AAAI Conferences

Scarcity of training data for word sense disambiguation argues for the use of knowledge-based disambiguation methods, which rely on information available in terminological resources. Unfortunately, these resources are not generally optimized to perform word sense disambiguation. On the other hand, there are many examples of ambiguous biomedical words with context in MEDLINE. However, these examples of ambiguity are not labeled with their proper sense. We propose the integration of the UMLS and MEDLINE to create concept profiles which are used to perform knowledge-based word sense disambiguation. Our results show an accuracy of 0.8770 on a biomedical word sense disambiguation data set; this represents a statistically significant improvement over other knowledge-based methods based on the UMLS on this data set.


Automatic Formalization of Clinical Practice Guidelines

AAAI Conferences

Current efforts aim to incorporate knowledge from clinical practice guidelines (CPGs) into computer systems using sophisticated interchange formats. Due to their complexity, such formats require expensive manual formalization work. This paper presents a preliminary study of using natural language processing (NLP) to automatically formalize CPG recommendations. We developed a CPG representation using concepts from the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED–CT), and manually applied this representation to a sample of CPG recommendations that is representative of multiple medical domains and recommendation types. Using this resource, we trained and evaluated a supervised classification model that formalizes new CPG recommendations according to the SNOMED–CT representation, achieving a precision of 75% and recall of 42% (F1 = 54%). We have identified two important lines of future investigation: (1) feature engineering to address the unique linguistic properties of CPG recommendations, and (2) alternative model formulations that are more robust to processing errors. A third line of investigation – creating additional training data for the NLP model – is shown to be of little utility.


An Inference Method for Disease Name Normalization

AAAI Conferences

PubMed ® and other literature databases contain a wealth of information on diseases and their diagnosis/treatment in the form of scientific publications. In order to take advantage of such rich information, several text-mining tools have been developed for automatically detecting mentions of disease names in the PubMed abstracts. The next important step is the normalization of the various disease names to standardized vocabulary entries and medical dictionaries. To this end, we present an automatic approach for mapping disease names in PubMed abstracts to their corresponding concepts in Medical Subject Headings (MeSH ® ) or Online Mendelian Inheritance in Man (OMIM ® ). For developing our algorithm, we merged disease concept annotations from two existing corpora. In addition, we hand annotated a separate test set of decease concepts for our method evaluation. Different from others, we reformulate the disease name normalization task as an information retrieval task where input queries are disease names and search results are disease concepts. As such, our inference method builds on existing Lucene search and further improves it by taking into account the string similarity of query terms to the disease concept name and synonyms. Evaluation results show that our method compares favorably to other state-of-the-art approaches. In conclusion, we find that our approach is a simple and effective way for linking disease names to controlled vocabularies and that the merged disease corpus provides added value for the development of text mining tools for named entity recognition from biomedical text. Data is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Dogan/disease.html


Discovering Health Beliefs in Twitter

AAAI Conferences

Social networking websites such as Twitter have invigorated a wide range of studies in recent years ranging from consumer opinions on products to tracking the spread of diseases. While sentiment analysis and opinion mining from tweets have been studied extensively, surveillance of beliefs, especially those related to public health, have received considerably less attention. In our previous work, we proposed a model for surveillance of health beliefs on Twitter relying on the use of hand-picked probe statements expressing various health-related propositions. In this work we extend our model to automatically discover various probes related to public health beliefs. We present a data driven approach based on two distinct datasets and study the prevalence of public belief, disbelief or doubt for newly discovered probe statements.