Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Jun-9-2026, 14:34:42 GMT–Neural Information Processing Systems

Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. To address this, we propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues--from sequential context to local details. Our approach features two sequentially dependent components: (i) Context-Level Optimization: By introducing low-cost sequence preference pairs, we optimize the model to distinguish between complete and disrupted multi-image contexts, thereby correcting cognitive biases in MLLMs' multi-image understanding.

artificial intelligence, natural language, optimization, (10 more...)

Neural Information Processing Systems

Jun-9-2026, 14:34:42 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.59)