Shrestha, Prashant
Surgical Vision World Model
Koju, Saurabh, Bastola, Saurav, Shrestha, Prashant, Amgain, Sanskar, Shrestha, Yash Raj, Poudel, Rudra P. K., Bhattarai, Binod
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset.
Investigating the Robustness of Vision Transformers against Label Noise in Medical Image Classification
Khanal, Bidur, Shrestha, Prashant, Amgain, Sanskar, Khanal, Bishesh, Bhattarai, Binod, Linte, Cristian A.
Label noise in medical image classification datasets significantly hampers the training of supervised deep learning methods, undermining their generalizability. The test performance of a model tends to decrease as the label noise rate increases. Over recent years, several methods have been proposed to mitigate the impact of label noise in medical image classification and enhance the robustness of the model. Predominantly, these works have employed CNN-based architectures as the backbone of their classifiers for feature extraction. However, in recent years, Vision Transformer (ViT)-based backbones have replaced CNNs, demonstrating improved performance and a greater ability to learn more generalizable features, especially when the dataset is large. Nevertheless, no prior work has rigorously investigated how transformer-based backbones handle the impact of label noise in medical image classification. In this paper, we investigate the architectural robustness of ViT against label noise and compare it to that of CNNs. We use two medical image classification datasets -- COVID-DU-Ex, and NCT-CRC-HE-100K -- both corrupted by injecting label noise at various rates. Additionally, we show that pretraining is crucial for ensuring ViT's improved robustness against label noise in supervised training.
Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval
Panta, Love, Shrestha, Prashant, Sapkota, Brabeem, Bhattarai, Amrita, Manandhar, Suresh, Sah, Anand Kumar
Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on ActivityNet Captions, all while utilizing significantly fewer parameters with respect to baseline.
Medical Vision Language Pretraining: A survey
Shrestha, Prashant, Amgain, Sanskar, Khanal, Bidur, Linte, Cristian A., Bhattarai, Binod
Abstract--Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations. Such pretrained models have the potential to enhance multiple downstream medical tasks simultaneously, reducing the dependency on labeled data. However, despite recent progress and its potential, there is no such comprehensive survey paper that has explored the various aspects and advancements in medical VLP. In this paper, we specifically review existing works through the lens of different pretraining objectives, architectures, downstream evaluation tasks, and datasets utilized for pretraining and downstream tasks. Subsequently, we delve into current challenges in medical VLP, discussing existing and potential solutions, and conclude by highlighting future directions. To the best of our knowledge, this is the first survey focused on medical VLP. Data-driven artificial intelligence (AI) has undergone rapid advancement in recent years, bringing transformative changes to various domains, including computer vision and natural language processing [1]-[5]. The availability of large-scale Figure 1: Various aspects of Medical Vision Language Pretraining data has played a pivotal role in driving this progress. With (VLP) discussed in this paper. AI is no longer confined to single-modality systems; instead, these multimodal datasets can play a crucial role in training there has been a notable shift towards multimodal learning [6]- large-scale, generalized AI models. Similar trends are quickly emerging, even within the In recent years, self-supervised learning has become a medical domain [10]-[13]. There is a particular emphasis on Often, medical experts rely on information from multiple vision-language models in both the general domain [9], [20], modalities for diagnostic decision-making. For instance, [21] and the medical domain [22]-[27], given that vision physicians consider various factors, including medical images, and language are two key data modalities. By employing blood test results, and sensor data, to recommend treatments.