The Telephone Game: Evaluating Semantic Drift in Unified Models
Mollah, Sabbir, Gupta, Rohit, Swetha, Sirnam, Liu, Qingyang, Munir, Ahnaf, Shah, Mubarak
–arXiv.org Artificial Intelligence
At every step we observe semantic drift. For example, in the 5th generation, the model fails to generate a convincing suitcase, which also hints at cross-inconsistency. These phenomena are magnified under the multi-generation telephone game evaluation, allowing it to capture more subtle performance differences between models. Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Multimodal Unified Models (UMs) combine visual understanding and generation within a single framework, enabling a wide range of unimodal tasks (e.g., text-to-text, image-to-image) as well as cross-modal tasks (e.g., image-to-text, text-to-image). Despite rapid model progress, UM evaluation remains fragmented. In other words, current single-pass metrics do not assess the retention of entities, attributes, relations, and counts under alternating I2T T2I conversions. We defer unimodal tasks and center our analysis on I2T and T2I tasks as the potential for semantic divergence and its impact on real use is most pronounced on the cross-modal tasks.
arXiv.org Artificial Intelligence
Oct-7-2025
- Country:
- North America > United States (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Transportation > Ground > Road (0.46)
- Technology: