Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference

Matsui, Yuta, Yamaki, Ryosuke, Ueda, Ryo, Shinagawa, Seitaro, Taniguchi, Tadahiro

Apr-15-2025–arXiv.org Artificial Intelligence

We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Apr-15-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.28)
- North America > United States (0.28)
- Europe > Switzerland (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks (0.68)
  - Representation & Reasoning
    - Information Fusion (1.00)
    - Agents (1.00)
    - Uncertainty > Bayesian Inference (0.60)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found