generation result
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. Text-to-Image (T2I) models have demonstrated impressive performance in generating high-quality, realistic images (Betker et al., 2023; Esser et al., 2024). Despite this success, T2I models continue to struggle with accurately interpreting and following user prompts. They may fail to generate objects with the correct number, attributes, or relationships (Li et al., 2024). However, assessing the alignment between text and generated image has remained a longstanding challenge. There are generally three approaches to evaluating image-text alignment. The first approach involves using pretrained image-text models to generate an overall alignment score. CLIP Score (Hessel et al., 2021) is a widely used metric, while VQAScore (Lin et al., 2024) is an improved version of CLIP Score. However, these metrics have several limitations, including their inability to accurately reflect the true alignment between the image and the text (Li et al., 2024) and failing to provide explainable evaluation results. Figure 1: A failure case generated by Stable-Diffusion-3.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > California (0.04)
- Europe > Monaco (0.04)
- (2 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
Wang, Junyou, Chen, Zehua, Yuan, Binjie, Zheng, Kaiwen, Li, Chang, Jiang, Yuxuan, Zhu, Jun
Guidance methods have demonstrated significant improvements in cross-modal audio generation, including text-to-audio (T2A) and video-to-audio (V2A) generation. The popularly adopted method, classifier-free guidance (CFG), steers generation by emphasizing condition alignment, enhancing fidelity but often at the cost of diversity. Recently, autoguidance (AG) has been explored for audio generation, encouraging the sampling to faithfully reconstruct the target distribution and showing increased diversity. Despite these advances, they usually rely on a single guiding principle, e.g., condition alignment in CFG or score accuracy in AG, leaving the full potential of guidance for audio generation untapped. In this work, we explore enriching the composition of the guidance method and present a mixture-of-guidance framework, AudioMoG. Within the design space, AudioMoG can exploit the complementary advantages of distinctive guiding principles by fulfilling their cumulative benefits. With a reduced form, AudioMoG can consider parallel complements or recover a single guiding principle, without sacrificing generality. We experimentally show that, given the same inference speed, AudioMoG approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. These results highlight a "free lunch" in current cross-modal audio generation systems: higher quality can be achieved through mixed guiding principles at the sampling stage without sacrificing inference efficiency. Demo samples are available at: https://audio-mog.github.io.
- Media > Music (0.94)
- Leisure & Entertainment (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Vision (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Steerable Scene Generation with Post Training and Inference-Time Search
Pfaff, Nicholas, Dai, Hongkai, Zakharov, Sergey, Iwase, Shun, Tedrake, Russ
Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
A Closed form Token level Decomposition
The typos do not affect related conclusions. For unsupervised LCG experiments, we use Y elp Reviews (Cho et al., 2018) and WMT News section Please refer to the official website of WMT dataset (Bojar et al., 2017) for more information about For MT experiments, we load the MarianMT from the es-en checkpoint provided by huggingface. All the hyperparameters are tuned on the development set. We simply report the results after the maximum number of training epochs (usually 20). For more implementation details and tricks, please refer to our code.