AITopics | slake

Collaborating Authors

slake

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding

Padhi, Trilok, Kaur, Ramneet, Cobb, Adam D., Acharya, Manoj, Roy, Anirban, Samplawski, Colin, Matejek, Brian, Berenbeim, Alexander M., Bastian, Nathaniel D., Jha, Susmit

arXiv.org Artificial IntelligenceMay-8-2025

We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model's confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.03788

Country: North America > United States (1.00)

Genre:

Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations

Zhu, Kangyu, Qin, Ziyuan, Yi, Huahui, Jiang, Zekun, Lao, Qicheng, Zhang, Shaoting, Li, Kang

arXiv.org Artificial IntelligenceJan-4-2025

With the recent advancements in vision-language models (VLMs) driven by large language models (LLMs), many researchers have focused on models that comprised of an image encoder, an image-to-language projection layer, and a text decoder architectures, leading to the emergence of works like LLava-Med. However, these works primarily operate at the whole-image level, aligning general information from 2D medical images without attending to finer details. As a result, these models often provide irrelevant or non-clinically valuable information while missing critical details. Medical vision-language tasks differ significantly from general images, particularly in their focus on fine-grained details, while excluding irrelevant content. General domain VLMs tend to prioritize global information due to their design, which compresses the entire image into a multi-token representation that is passed into the LLM decoder. Therefore, current VLMs all lack the capability to restrict their attention to particular areas. To address this critical issue in the medical domain, we introduce MedVP, an visual prompt generation and fine-tuning framework, which involves extract medical entities, generate visual prompts, and adapt datasets for visual prompt guided fine-tuning. To the best of our knowledge, this is the first work to explicitly introduce visual prompt into medical VLMs, and we successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2501.02385

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Jiang, Songtao, Zheng, Tuo, Zhang, Yan, Jin, Yeying, Yuan, Li, Liu, Zuozhu

arXiv.org Artificial IntelligenceJun-26-2024

Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.

llava-med, med-moe, preprint arxiv, (16 more...)

arXiv.org Artificial Intelligence

2404.10237

Country:

North America > United States (0.04)
Europe > Sweden (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.64)
Instructional Material (0.55)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model

Kim, Taehee, Cho, Yeongjae, Shin, Heejun, Jo, Yohan, Shin, Dongmyung

arXiv.org Artificial IntelligenceJan-16-2024

Visual question answering (VQA) is a task where an image is given, and a series of questions are asked about the image. To build an efficient VQA algorithm, a large amount of QA data is required which is very expensive. Generating synthetic QA pairs based on templates is a practical way to obtain data. However, VQA models trained on those data do not perform well on complex, human-written questions. To address this issue, we propose a new method called {\it chain of QA for human-written questions} (CoQAH). CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions. We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images and found that it achieved state-of-the-art accuracy in both types of data. Notably, CoQAH outperformed general vision-language models, VQA models, and medical foundation models with no finetuning.

coqah, dataset, vqa model, (16 more...)

arXiv.org Artificial Intelligence

2401.064

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering

Liu, Bo, Zhan, Li-Ming, Xu, Li, Ma, Lin, Yang, Yan, Wu, Xiao-Ming

arXiv.org Artificial IntelligenceFeb-18-2021

Medical visual question answering (Med-VQA) has tremendous potential in healthcare. However, the development of this technology is hindered by the lacking of publicly-available and high-quality labeled datasets for training and evaluation. In this paper, we present a large bilingual dataset, SLAKE, with comprehensive semantic labels annotated by experienced physicians and a new structural medical knowledge base for Med-VQA. Besides, SLAKE includes richer modalities and covers more human body parts than the currently available dataset. We show that SLAKE can be used to facilitate the development and evaluation of Med-VQA systems. The dataset can be downloaded from http://www.med-vqa.com/slake.

dataset, knowledge graph, slake, (13 more...)

arXiv.org Artificial Intelligence

2102.09542

Country:

Asia > China > Hong Kong (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (0.70)
Health & Medicine > Diagnostic Medicine > Imaging (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.87)

Add feedback