Medical Vision Language Pretraining: A survey

Shrestha, Prashant, Amgain, Sanskar, Khanal, Bidur, Linte, Cristian A., Bhattarai, Binod

Dec-11-2023–arXiv.org Artificial Intelligence

Abstract--Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations. Such pretrained models have the potential to enhance multiple downstream medical tasks simultaneously, reducing the dependency on labeled data. However, despite recent progress and its potential, there is no such comprehensive survey paper that has explored the various aspects and advancements in medical VLP. In this paper, we specifically review existing works through the lens of different pretraining objectives, architectures, downstream evaluation tasks, and datasets utilized for pretraining and downstream tasks. Subsequently, we delve into current challenges in medical VLP, discussing existing and potential solutions, and conclude by highlighting future directions. To the best of our knowledge, this is the first survey focused on medical VLP. Data-driven artificial intelligence (AI) has undergone rapid advancement in recent years, bringing transformative changes to various domains, including computer vision and natural language processing [1]-[5]. The availability of large-scale Figure 1: Various aspects of Medical Vision Language Pretraining data has played a pivotal role in driving this progress. With (VLP) discussed in this paper. AI is no longer confined to single-modality systems; instead, these multimodal datasets can play a crucial role in training there has been a notable shift towards multimodal learning [6]- large-scale, generalized AI models. Similar trends are quickly emerging, even within the In recent years, self-supervised learning has become a medical domain [10]-[13]. There is a particular emphasis on Often, medical experts rely on information from multiple vision-language models in both the general domain [9], [20], modalities for diagnostic decision-making. For instance, [21] and the medical domain [22]-[27], given that vision physicians consider various factors, including medical images, and language are two key data modalities. By employing blood test results, and sensor data, to recommend treatments.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-11-2023

arXiv.org PDF

Add feedback

Country:
- Europe (0.93)
- North America > United States (0.28)

Genre:
- Overview (1.00)
- Research Report > Promising Solution (0.54)

Industry:
- Health & Medicine
  - Diagnostic Medicine > Imaging (1.00)
  - Health Care Technology (1.00)
  - Nuclear Medicine (1.00)
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Neurology (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Neural Networks > Deep Learning (0.46)
      - Performance Analysis > Accuracy (0.68)
    - Natural Language > Large Language Model (0.69)
    - Representation & Reasoning (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)