AITopics | Shrestha, Prashant

Collaborating Authors

Shrestha, Prashant

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Surgical Vision World Model

Koju, Saurabh, Bastola, Saurav, Shrestha, Prashant, Amgain, Sanskar, Shrestha, Yash Raj, Poudel, Rudra P. K., Bhattarai, Binod

arXiv.org Artificial IntelligenceMar-3-2025

Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset.

artificial intelligence, machine learning, world model, (15 more...)

arXiv.org Artificial Intelligence

2503.02904

Country: Europe > Switzerland (0.14)

Genre: Research Report (0.65)

Industry:

Health & Medicine > Surgery (0.95)
Health & Medicine > Diagnostic Medicine > Imaging (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Investigating the Robustness of Vision Transformers against Label Noise in Medical Image Classification

Khanal, Bidur, Shrestha, Prashant, Amgain, Sanskar, Khanal, Bishesh, Bhattarai, Binod, Linte, Cristian A.

arXiv.org Artificial IntelligenceFeb-26-2024

Label noise in medical image classification datasets significantly hampers the training of supervised deep learning methods, undermining their generalizability. The test performance of a model tends to decrease as the label noise rate increases. Over recent years, several methods have been proposed to mitigate the impact of label noise in medical image classification and enhance the robustness of the model. Predominantly, these works have employed CNN-based architectures as the backbone of their classifiers for feature extraction. However, in recent years, Vision Transformer (ViT)-based backbones have replaced CNNs, demonstrating improved performance and a greater ability to learn more generalizable features, especially when the dataset is large. Nevertheless, no prior work has rigorously investigated how transformer-based backbones handle the impact of label noise in medical image classification. In this paper, we investigate the architectural robustness of ViT against label noise and compare it to that of CNNs. We use two medical image classification datasets -- COVID-DU-Ex, and NCT-CRC-HE-100K -- both corrupted by injecting label noise at various rates. Additionally, we show that pretraining is crucial for ensuring ViT's improved robustness against label noise in supervised training.

artificial intelligence, image understanding, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2402.16734

Country:

North America > United States (0.47)
Asia (0.28)

Genre:

Instructional Material > Online (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval

Panta, Love, Shrestha, Prashant, Sapkota, Brabeem, Bhattarai, Amrita, Manandhar, Suresh, Sah, Anand Kumar

arXiv.org Artificial IntelligenceDec-12-2023

Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on ActivityNet Captions, all while utilizing significantly fewer parameters with respect to baseline.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2312.07435

Country: Europe > Italy (0.14)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Medical Vision Language Pretraining: A survey

Shrestha, Prashant, Amgain, Sanskar, Khanal, Bidur, Linte, Cristian A., Bhattarai, Binod

arXiv.org Artificial IntelligenceDec-11-2023

Abstract--Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations. Such pretrained models have the potential to enhance multiple downstream medical tasks simultaneously, reducing the dependency on labeled data. However, despite recent progress and its potential, there is no such comprehensive survey paper that has explored the various aspects and advancements in medical VLP. In this paper, we specifically review existing works through the lens of different pretraining objectives, architectures, downstream evaluation tasks, and datasets utilized for pretraining and downstream tasks. Subsequently, we delve into current challenges in medical VLP, discussing existing and potential solutions, and conclude by highlighting future directions. To the best of our knowledge, this is the first survey focused on medical VLP. Data-driven artificial intelligence (AI) has undergone rapid advancement in recent years, bringing transformative changes to various domains, including computer vision and natural language processing [1]-[5]. The availability of large-scale Figure 1: Various aspects of Medical Vision Language Pretraining data has played a pivotal role in driving this progress. With (VLP) discussed in this paper. AI is no longer confined to single-modality systems; instead, these multimodal datasets can play a crucial role in training there has been a notable shift towards multimodal learning [6]- large-scale, generalized AI models. Similar trends are quickly emerging, even within the In recent years, self-supervised learning has become a medical domain [10]-[13]. There is a particular emphasis on Often, medical experts rely on information from multiple vision-language models in both the general domain [9], [20], modalities for diagnostic decision-making. For instance, [21] and the medical domain [22]-[27], given that vision physicians consider various factors, including medical images, and language are two key data modalities. By employing blood test results, and sensor data, to recommend treatments.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2312.06224

Country:

Europe (0.93)
North America > United States (0.28)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.54)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Nuclear Medicine (1.00)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback