OCR-Based Image Features for Biomedical Image and Article Classification: Identifying Documents Relevant to Genomic Cis-Regulatory Elements

Shatkay, Hagit ( University of Delaware ) | Narayanaswamy, Ramya (University of Delaware) | Nagaral, Santosh S. (University of Delaware) | Harrington, Na (Queen's University) | MV, Rohith (University of Delaware) | Somanath, Gowri (University of Delaware) | Tarpine, Ryan (Brown University) | Schutter, Kyle (Brown University) | Johnstone, Tim (Brown University) | Blostein, Dorothea (Queen's University) | Istrail, Sorin (Brown University) | Kambhamettu, Chandra (University of Delaware)

AAAI Conferences 

Images form a significant, yet under-utilized, information source in published biomedical articles. Much current work on biomedical image retrieval and classification uses simple, standard image representation employing features such as edge direction or gray scale histograms. In our earlier work we have used such features as well to classify images, where image-class-tags have been used to represent and classify complete articles. Here we focus on a different literature classification task: identifying articles discussing cis-regulatory elements and modules, motivated by the need to understand complex gene-networks. Curators attempting to identify such articles use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (such as gray scale) is highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, forms a novel image representation, which allows us to identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of DNA-rich images within articles, we train a classifier to identify articles pertaining to cis-regulatory elements with a similarly high precision and recall. Using OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, automatically identifying such images is applicable beyond the current use-case, in other important biomedical document classification tasks.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found