Estimating Redundancy in Clinical Text

Searle, Thomas, Ibrahim, Zina, Teo, James, Dobson, Richard JB

Oct-26-2021–arXiv.org Artificial Intelligence

The current mode of use of Electronic Health Record (EHR) elicits text redundancy. Clinicians often populate new documents by duplicating existing notes, then updating accordingly. Data duplication can lead to a propagation of errors, inconsistencies and misreporting of care. Therefore, quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical narratives. This work is a quantitative examination of information redundancy in EHR notes. We present and evaluate two strategies to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model. We evaluate the measures by training large Transformer-based language models using clinical text from a large openly available US-based ICU dataset and a large multi-site UK based Trust. By comparing the information-theoretic content of the trained models with open-domain language models, the language models trained using clinical text have shown ~1.5x to ~3x less efficient than open-domain corpora. Manual evaluation shows a high correlation with lexicosyntactic and semantic redundancy, with averages ~43 to ~65%.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-26-2021

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom (0.88)
- North America > United States
  - Minnesota (0.28)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Health & Medicine
  - Health Care Technology > Medical Record (1.00)
  - Therapeutic Area (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.89)
  - Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found