Bozkurt, Selen
HILGEN: Hierarchically-Informed Data Generation for Biomedical NER Using Knowledgebases and Large Language Models
Ge, Yao, Guo, Yuting, Das, Sudeshna, Rajwal, Swati, Bozkurt, Selen, Sarker, Abeed
We present HILGEN, a Hierarchically-Informed Data Generation approach that combines domain knowledge from the Unified Medical Language System (UMLS) with synthetic data generated by large language models (LLMs), specifically GPT-3.5. Our approach leverages UMLS's hierarchical structure to expand training data with related concepts, while incorporating contextual information from LLMs through targeted prompts aimed at automatically generating synthetic examples for sparsely occurring named entities. The performance of the HILGEN approach was evaluated across four biomedical NER datasets (MIMIC III, BC5CDR, NCBI-Disease, and Med-Mentions) using BERT-Large and DANN (Data Augmentation with Nearest Neighbor Classifier) models, applying various data generation strategies, including UMLS, GPT-3.5, and their best ensemble. For the BERT-Large model, incorporating UMLS led to an average F1 score improvement of 40.36%, while using GPT-3.5 resulted in a comparable average increase of 40.52%. The Best-Ensemble approach using BERT-Large achieved the highest improvement, with an average increase of 42.29%. DANN model's F1 score improved by 22.74% on average using the UMLS-only approach. The GPT-3.5-based method resulted in a 21.53% increase, and the Best-Ensemble DANN model showed a more notable improvement, with an average increase of 25.03%. Our proposed HILGEN approach improves NER performance in few-shot settings without requiring additional manually annotated data. Our experiments demonstrate that an effective strategy for optimizing biomedical NER is to combine biomedical knowledge curated in the past, such as the UMLS, and generative LLMs to create synthetic training instances. Our future research will focus on exploring additional innovative synthetic data generation strategies for further improving NER performance.
Cerebral microbleeds: Association with cognitive decline and pathology build-up
Rathore, Saima, Chaudhary, Jatin, Tong, Boning, Bozkurt, Selen
Cerebral microbleeds, markers of brain damage from vascular and amyloid pathologies, are linked to cognitive decline in aging, but their role in Alzheimer's disease (AD) onset and progression remains unclear. This study aimed to explore whether the presence and location of lobar microbleeds are associated with amyloid-$\beta$ (A$\beta$)-PET, tau tangle formation (tau-PET), and longitudinal cognitive decline. We analyzed 1,573 ADNI participants with MR imaging data and information on the number and location of microbleeds. Associations between lobar microbleeds and pathology, cerebrospinal fluid (CSF), genetics, and cognition were examined, focusing on regional microbleeds and domain-specific cognitive decline using ordinary least-squares regression while adjusting for covariates. Cognitive decline was assessed with ADAS-Cog11 and its domain-specific sub-scores. Participants underwent neuropsychological testing at least twice, with a minimum two-year interval between assessments. Among the 1,573 participants (692 women, mean age 71.23 years), 373 participants had microbleeds. The presence of microbleeds was linked to cognitive decline, particularly in the semantic, language, and praxis domains for those with temporal lobe microbleeds. Microbleeds in the overall cortex were associated with language decline. Pathologically, temporal lobe microbleeds were associated with increased tau in the overall cortex, while cortical microbleeds were linked to elevated A$\beta$ in the temporal, parietal, and frontal regions. In this mixed population, microbleeds were connected to longitudinal cognitive decline, especially in semantic and language domains, and were associated with higher baseline A$\beta$ and tau pathology. These findings suggest that lobar microbleeds should be included in AD diagnostic and prognostic evaluations.
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data
Das, Sudeshna, Ge, Yao, Guo, Yuting, Rajwal, Swati, Hairston, JaMor, Powell, Jeanne, Walker, Drew, Peddireddy, Snigdha, Lakamana, Sahithi, Bozkurt, Selen, Reyna, Matthew, Sameni, Reza, Xiao, Yunyu, Kim, Sangmi, Chandler, Rasheeta, Hernandez, Natalie, Mowery, Danielle, Wightman, Rachel, Love, Jennifer, Spadaro, Anthony, Perrone, Jeanmarie, Sarker, Abeed
Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users.
Social Media as a Sensor: Analyzing Twitter Data for Breast Cancer Medication Effects Using Natural Language Processing
Kobara, Seibi, Rafiei, Alireza, Nateghi, Masoud, Bozkurt, Selen, Kamaleswaran, Rishikesan, Sarker, Abeed
Breast cancer is a significant public health concern and is the leading cause of cancer-related deaths among women. Despite advances in breast cancer treatments, medication non-adherence remains a major problem. As electronic health records do not typically capture patient-reported outcomes that may reveal information about medication-related experiences, social media presents an attractive resource for enhancing our understanding of the patients' treatment experiences. In this paper, we developed natural language processing (NLP) based methodologies to study information posted by an automatically curated breast cancer cohort from social media. We employed a transformer-based classifier to identify breast cancer patients/survivors on X (Twitter) based on their self-reported information, and we collected longitudinal data from their profiles. We then designed a multi-layer rule-based model to develop a breast cancer therapy-associated side effect lexicon and detect patterns of medication usage and associated side effects among breast cancer patients. 1,454,637 posts were available from 583,962 unique users, of which 62,042 were detected as breast cancer members using our transformer-based model. 198 cohort members mentioned breast cancer medications with tamoxifen as the most common. Our side effect lexicon identified well-known side effects of hormone and chemotherapy. Furthermore, it discovered a subject feeling towards cancer and medications, which may suggest a pre-clinical phase of side effects or emotional distress. This analysis highlighted not only the utility of NLP techniques in unstructured social media data to identify self-reported breast cancer posts, medication usage patterns, and treatment side effects but also the richness of social data on such clinical questions.