MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Jun-10-2026, 08:42:42 GMT–Neural Information Processing Systems

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine-grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different).

artificial intelligence, machine learning, reinforcement learning, (5 more...)

Neural Information Processing Systems

Jun-10-2026, 08:42:42 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (0.62)
  - Vision (0.56)
  - Machine Learning > Reinforcement Learning (0.30)