Chen, Helen
Synthetic Data from Diffusion Models Improve Drug Discovery Prediction
Hu, Bing, Saragadam, Ashish, Layton, Anita, Chen, Helen
There is a growing trend towards leveraging artificial intelligence (AI) in every stage of drug development [1]. Drug development is an expensive process: it costs $2-3 billion dollars and 13-15 years to bring a single drug to market. Drug discovery AI, by enabling the high-throughput screening (HTS) of ligand candidates, is geared to reduce the developmental costs of drugs by revolutionizing how ligands are designed and tested [2]. Drug development AI has found great initial success such as in poly-pharmacy [3], drug re-purposing [4, 5], drug-target interaction [6], drug response prediction [7], and in search of new antibiotics [8]. Equally important to advances in AI for drug discovery are the equal improvements in available public data for training and testing these models [9, 10, 11]. Only through equal strides in the development and refinement of drug discovery data, and the application of advanced AI models to that data, do breakthroughs happen for AI-based methods for drug discovery. Huang et al. [9] noted 3 key challenges for drug discovery data to attracting ML scientists to therapeutics: (1) a lack of AI-ready datasets and standardized knowledge representations; (2) datasets scattered around the bio-repositories without curation; (3) a lack of data focused for rare diseases and novel drugs in development. We posit another data challenge that slows the advancement of drug discovery AI: datasets are often collected independently, often with little overlap, creating data sparsity. Data sparsity poses difficulties for researchers looking to answer research questions requiring data values posed across multiple different datasets.
Proposing a conceptual framework: social media listening for public health behavior
Tsao, Shu-Feng, Chen, Helen, Meyer, Samantha, Butt, Zahid A.
Existing communications and behavioral theories have been adopted to address health misinformation. Although various theories and models have been used to investigate the COVID-19 pandemic, there is no framework specially designed for social listening or misinformation studies using social media data and natural language processing techniques. This study aimed to propose a novel yet theory-based conceptual framework for misinformation research. We collected theories and models used in COVID-19 related studies published in peer-reviewed journals. The theories and models ranged from health behaviors, communications, to misinformation. They are analyzed and critiqued for their components, followed by proposing a conceptual framework with a demonstration. We reviewed Health Belief Model, Theory of Planned Behavior/Reasoned Action, Communication for Behavioral Impact, Transtheoretical Model, Uses and Gratifications Theory, Social Judgment Theory, Risk Information Seeking and Processing Model, Behavioral and Social Drivers, and Hype Loop. Accordingly, we proposed the Social Media Listening for Public Health Behavior Conceptual Framework by not only integrating important attributes of existing theories, but also adding new attributes. The proposed conceptual framework was demonstrated in the Freedom Convoy social media listening. The proposed conceptual framework can be used to better understand public discourse on social media, and it can be integrated with other data analyses to gather a more comprehensive picture. The framework will continue to be revised and adopted as health misinformation evolves.
UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus
Michalopoulos, George, Wang, Yuanxin, Kaka, Hussam, Chen, Helen, Wong, Alex
Contextual word embedding models, such as BioBERT and Bio_ClinicalBERT, have achieved state-of-the-art results in biomedical natural language processing tasks by focusing their pre-training process on domain-specific corpora. However, such models do not take into consideration expert domain knowledge. In this work, we introduced UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process via a novel knowledge augmentation strategy. More specifically, the augmentation on UmlsBERT with the Unified Medical Language System (UMLS) Metathesaurus was performed in two ways: i) connecting words that have the same underlying `concept' in UMLS, and ii) leveraging semantic group knowledge in UMLS to create clinically meaningful input embeddings. By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models on common named-entity recognition (NER) and clinical natural language inference clinical NLP tasks.