SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies
Yu, Zehao, Yang, Xi, Dang, Chong, Adekkanattu, Prakash, Patra, Braja Gopal, Peng, Yifan, Pathak, Jyotishman, Wilson, Debbie L., Chang, Ching-Yuan, Lo-Ciganic, Wei-Hsuan, George, Thomas J., Hogan, William R., Guo, Yi, Bian, Jiang, Wu, Yonghui
–arXiv.org Artificial Intelligence
Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.
arXiv.org Artificial Intelligence
May-18-2023
- Country:
- North America > United States > Florida > Alachua County > Gainesville (0.28)
- Genre:
- Research Report > Observational Study (0.34)
- Industry:
- Health & Medicine > Therapeutic Area
- Oncology (1.00)
- Psychiatry/Psychology > Addiction Disorder (1.00)
- Health & Medicine > Therapeutic Area
- Technology: