Goto

Collaborating Authors

 geneval


GenEval: An object-focused framework for evaluating text-to-image alignment

Neural Information Processing Systems

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a distribution-level measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative reasoning capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models.


Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent

Lin, Jianzhe, Pan, Zeyu, Zhu, Yun, Song, Ruiqi, Yang, Jining

arXiv.org Artificial Intelligence

W e introduce SuperIntelliAgent, an agentic learning framework that couples a trainable small diffusion model (the learner) with a frozen large language model (the verifier), enabling continual intelligence growth through self-supervised interaction. Unlike conventional supervised fine-tuning with annotated data, SuperIntelliAgent learns autonomously in an annotation-free manner: the learner generates candidate outputs, the verifier evaluates them via step-by-step reasoning, and the learner-verifier interaction loop produces chosen/rejected pairs for Direct Preference Optimization (DPO), transforming every input into a pseudo-training signal for continual self-improvement. The framework integrates a dual-scale memory mechanism--short-term, in-context memory that preserves reasoning traces across iterative refinement cycles, and long-term memory that consolidates acquired knowledge into model parameters through on-the-fly fine-tuning. T o enhance optimization, a replay buffer selectively retains samples showing verifiable progress from failed to satisfied conditions and replays them as auxiliary supervision, reinforcing recent learning while bootstrapping adaptive curricula that accelerate intelligence acquisition. Designed to be infrastructure-agnostic, SuperIntelliAgent can be seamlessly integrated into existing agentic frameworks (e.g., autogen, semantic kernel, etc.), while simultaneously transforming ordinary inference cycles into lifelong optimization. W e posit that agentic pairing constitutes the minimal reliable unit of growing intelligence, as paired feedback, augmented with partial-history replay, yields richer learning curricula, tighter preference alignment, and stronger generalization. With extremely few DPO pairs generated automatically by SuperIntelliAgent and used for lightweight fine-tuning, the learner performance improves across all benchmarks.


Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

Gu, Zeqi, Georgopoulos, Markos, Dai, Xiaoliang, Ghazvininejad, Marjan, Wang, Chu, Juefei-Xu, Felix, Li, Kunpeng, Shi, Yujun, He, Zecheng, He, Zijian, Zhou, Jiawei, Davis, Abe, Wang, Jialiang

arXiv.org Artificial Intelligence

Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.


Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

Bader, Jessica, Pach, Mateusz, Bravo, Maria A., Belongie, Serge, Akata, Zeynep

arXiv.org Artificial Intelligence

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.


GenEval: An object-focused framework for evaluating text-to-image alignment

Neural Information Processing Systems

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a distribution-level measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color.


Measuring Retrieval Complexity in Question Answering Systems

Gabburo, Matteo, Jedema, Nicolaas Paul, Garg, Siddhant, Ribeiro, Leonardo F. R., Moschitti, Alessandro

arXiv.org Artificial Intelligence

In this paper, we investigate which questions are challenging for retrieval-based Question Answering (QA). We (i) propose retrieval complexity (RC), a novel metric conditioned on the completeness of retrieved documents, which measures the difficulty of answering questions, and (ii) propose an unsupervised pipeline to measure RC given an arbitrary retrieval system. Our proposed pipeline measures RC more accurately than alternative estimators, including LLMs, on six challenging QA benchmarks. Further investigation reveals that RC scores strongly correlate with both QA performance and expert judgment across five of the six studied benchmarks, indicating that RC is an effective measure of question difficulty. Subsequent categorization of high-RC questions shows that they span a broad set of question shapes, including multi-hop, compositional, and temporal QA, indicating that RC scores can categorize a new subset of complex questions. Our system can also have a major impact on retrieval-based systems by helping to identify more challenging questions on existing datasets.


GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Ghosh, Dhruba, Hajishirzi, Hanna, Schmidt, Ludwig

arXiv.org Artificial Intelligence

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.