DisEmbed: Transforming Disease Understanding through Embeddings

Faroz, Salman

arXiv.org Artificial Intelligence 

When it comes to understanding diseases, many existing models, such as ClinicalBERT and BioBERT, struggle due to their broad generalization across the medical domain. While these models perform well in general healthcare contexts, they often fail to capture the nuanced relationships between specific diseases and their symptoms. For example, in use cases like Clinical Decision Support, disease diagnosis systems, and disease categorization based on symptoms, these models fall short. They can identify that a given text is related to the medical field, but they often do not understand whether the entities in the text are directly related. For instance, while both "brain surgery" and "parkinson's disease" are medical terms, a medical/general model might mistakenly associate them because it treats both as medical concepts, leading to high cosine similarity, even though they are unrelated. To address this gap, I have curated a synthetic dataset focused solely on diseases, where the descriptions and symptoms are not explicitly labeled with symptom names. This forces the model to learn deeper and more precise associations and not rely solely on superficial medical terminology. Although there is an inherent understanding of the correlations between symptoms and diseases, this approach promotes a more focused and accurate understanding of the disease.