Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

May-28-2024–arXiv.org Artificial Intelligence

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and generating content across various modalities, such as images and text. However, their interpretability remains a challenge, hindering their adoption in critical applications. This research proposes a novel approach to enhance the interpretability of MLLMs by focusing on the image embedding component. We combine an open-world localization model with a MLLM, thus creating a new architecture able to simultaneously produce text and object localization outputs from the same vision embedding. The proposed architecture greatly promotes interpretability, enabling us to design a novel saliency map to explain any output token, to identify model hallucinations, and to assess model biases through semantic adversarial perturbations.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

May-28-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland > Zürich > Zürich (0.14)

Genre:
- Research Report (0.84)

Industry:
- Leisure & Entertainment > Sports > Skiing (0.30)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found