Goto

Collaborating Authors

 Velupillai, Sumithra


Sample Size in Natural Language Processing within Healthcare Research

arXiv.org Artificial Intelligence

Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension, the most common diagnosis code in the database. Simulations were performed using various classifiers on different sample sizes and class proportions. This was repeated for a comparatively less common diagnosis code within the database of diabetes mellitus without mention of complication. Smaller sample sizes resulted in better results when using a K-nearest neighbours classifier, whereas larger sample sizes provided better results with support vector machines and BERT models. Overall, a sample size larger than 1000 was sufficient to provide decent performance metrics. The simulations conducted within this study provide guidelines that can be used as recommendations for selecting appropriate sample sizes and class proportions, and for predicting expected performance, when building classifiers for textual healthcare data. The methodology used here can be modified for sample size estimates calculations with other datasets.


Development of a Knowledge Graph Embeddings Model for Pain

arXiv.org Artificial Intelligence

Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models.


Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach

arXiv.org Artificial Intelligence

Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous nature. This project uses data from an anonymised mental health electronic health records database. The data are used to train a machine learning based classification algorithm to classify sentences as discussing patient pain or not. This will facilitate the extraction of relevant pain information from large databases, and the use of such outputs for further studies on pain and mental health. 1,985 documents were manually triple-annotated for creation of gold standard training data, which was used to train three commonly used classification algorithms. The best performing model achieved an F1-score of 0.98 (95% CI 0.98-0.99).


Is De-identification of Electronic Health Records Possible? OR Can We Use Health Record Corpora for Research?

AAAI Conferences

Today an immense volume of electronic health records (EHRs) is being produced. These health records contain abundant information, in the form of both structured and unstructured data. It is estimated that EHRs contain on average around 60 percent structured information, and 40 percent unstructured information that is mostly free text (Dalianis et al., 2009). A modern health record is very complex and contains a large and diverse amount of data, such as the patient’s chief complaints, diagnoses and treatment, and very often an epicrisis, or discharge letter, together with ICD-10 codes, (ICD-10, 2009). Moreover, the health record also contains information about the patient’s gender, age, times of health care visits, medication, measure values, general condition as well as social situation, drinking and eating habits. Much of this information is written in natural language. All this information in a health record is currently almost never re-used, in particular the parts that are written in free text. We believe that the information contained in EHR data sets is an invaluable source for the development and evaluation of a number of applications, useful both for research purposes as well as health practitioners. For instance, text mining tools for finding new or hidden relations between diagnoses/treatments and social situation, age and gender could be very useful for epidemiological or medical researchers. Moreover, information concerning the health process over time, per patient, clinic or hospital, can be extracted and used for further research. Another application is the use of this data as input for simulation of the health process and for future health needs. Also, such huge health record databases can be used as corpora for the generation of generalized synonyms from specialized medical terminology constitutes another exciting application. We can also foresee a text summarization system applied to an individual patient’s health record, but using knowledge from all text records and conveying the information in the health record at the right level to the specific patient. The data can also be used for developing methods where clinicians in their daily work get automatic assistance and proposals of ICD-10 codes for assigning symptoms or diagnoses, or for validating the already manually assigned ICD-10 codes.