Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Kraljevic, Zeljko, Searle, Thomas, Shek, Anthony, Roguski, Lukasz, Noor, Kawsar, Bean, Daniel, Mascio, Aurelie, Zhu, Leilei, Folarin, Amos A, Roberts, Angus, Bendayan, Rebecca, Richardson, Mark P, Stewart, Robert, Shah, Anoop D, Wong, Wai Keong, Ibrahim, Zina, Teo, James T, Dobson, Richard JB

Oct-2-2020–arXiv.org Artificial Intelligence

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customizing and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets ( F1 0.467-0.791 vs 0.384-0.691). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability ( F1 >0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

disorder, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-2-2020

arXiv.org PDF

Add feedback

Country:
- Europe
  - Denmark (0.04)
  - United Kingdom
    - Northern Ireland (0.04)
    - England
      - Greater London > London (0.14)
      - Oxfordshire (0.04)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.94)

Industry:
- Health & Medicine
  - Health Care Technology > Medical Record (1.00)
  - Health Care Providers & Services (1.00)
  - Therapeutic Area
    - Neurology (1.00)
    - Cardiology/Vascular Diseases (1.00)
    - Infections and Infectious Diseases (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found