Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs Matthieu Cord
–Neural Information Processing Systems
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the de facto building block for Large Multimodal Models (LMMs), yet, we still lack a proper understanding of their success. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation aiming to understand their generalization beyond textual inputs. Our work provides the following findings. Perceptual tokens (1) are easily distinguishable from textual ones inside LLMs, with significantly different representations (e.g.
Neural Information Processing Systems
Mar-27-2025, 13:48:36 GMT
- Country:
- Asia > China (0.14)
- North America > United States (0.14)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.68)
- Research Report
- Technology: