Huang, Ziwei
MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation
Wang, Yi, Liu, Mushui, He, Wanggui, Zhang, Longxiang, Huang, Ziwei, Zhang, Guanghao, Shu, Fangxun, Tao, Zhong, She, Dong, Yu, Zhelun, Li, Haoyuan, Dai, Weilong, Song, Mingli, Song, Jie, Jiang, Hao
Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
Rethinking Light Decoder-based Solvers for Vehicle Routing Problems
Huang, Ziwei, Zhou, Jianan, Cao, Zhiguang, Xu, Yixin
Light decoder-based solvers have gained popularity for solving vehicle routing problems (VRPs) due to their efficiency and ease of integration with reinforcement learning algorithms. This paper revisits light decoder-based approaches, analyzing the implications of their reliance on static embeddings and the inherent challenges that arise. Specifically, we demonstrate that in the light decoder paradigm, the encoder is implicitly tasked with capturing information for all potential decision scenarios during solution construction within a single set of embeddings, resulting in high information density. Furthermore, our empirical analysis reveals that the overly simplistic decoder struggles to effectively utilize this dense information, particularly as task complexity increases, which limits generalization to out-of-distribution (OOD) settings. Building on these insights, we show that enhancing the decoder capacity, with a simple addition of identity mapping and a feed-forward layer, can considerably alleviate the generalization issue. Experimentally, our method significantly enhances the OOD generalization of light decoder-based approaches on large-scale instances and complex VRP variants, narrowing the gap with the heavy decoder paradigm. Our code is available at: https://github.com/ V ehicle Routing Problems (VRPs) are a fundamental class of NP-hard combinatorial optimization problems (COPs) with wide-ranging applications in logistics (Konstantakopoulos et al., 2022), transportation (Garaix et al., 2010), and supply chain management (Dondo et al., 2011). Efficiently solving VRPs is critical for reducing operational costs and enhancing service quality in practice. Traditionally, VRPs have been tackled either using exact solvers (e.g., Gurobi) or heuristic solvers (e.g., LKH-3 (Helsgaun, 2017)). While these methods can yield high-quality (or even optimal) solutions for small to moderate-sized instances, they often face challenges in scaling to larger problem sizes or adapting to different problem variants without extensive domain expertise or manual tuning. Neural solvers have emerged as a promising alternative by leveraging advanced deep learning techniques to learn solution strategies directly from data (Bengio et al., 2021). Numerous neural solvers have been proposed for solving VRPs (Bogyrbayeva et al., 2024), with autoregressive construction solvers gaining particular popularity. These solvers sequentially build solutions by adding one feasible node at a time and are valued for their conceptual simplicity and flexibility across different VRP variants. Among them, (heavy encoder) light decoder-based solvers (Vinyals et al., 2015; Kool et al., 2019; Kwon et al., 2020; Kim et al., 2022; Gao et al., 2024; Liu et al., 2024) stand out for their computational efficiency and ease of integration with reinforcement learning (RL) algorithms.
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts
Huang, Ziwei, He, Wanggui, Long, Quanyu, Wang, Yandi, Li, Haoyuan, Yu, Zhelun, Shu, Fangxun, Chan, Long, Jiang, Hao, Gan, Leilei, Wu, Fei
Evaluating the quality of synthesized images remains a significant challenge in the development of text-to-image (T2I) generation. Most existing studies in this area primarily focus on evaluating text-image alignment, image quality, and object composition capabilities, with comparatively fewer studies addressing the evaluation of the factuality of T2I models, particularly when the concepts involved are knowledge-intensive. To mitigate this gap, we present T2I-FactualBench in this work - the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation. T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts. We further introduce a multi-round visual question answering (VQA) based evaluation framework to assess the factuality of three-tiered knowledge-intensive text-to-image generation tasks. Experiments on T2I-FactualBench indicate that current state-of-the-art (SOTA) T2I models still leave significant room for improvement.
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Xiao, Wenyi, Huang, Ziwei, Gan, Leilei, He, Wanggui, Li, Haoyuan, Yu, Zhelun, Jiang, Hao, Wu, Fei, Zhu, Linchao
The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by proprietary models or human experts). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a hallucination detection model which can perform sentence-level hallucination detection, covering primary hallucination types (i.e., object, attribute, and relationship). Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) for mitigating hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments demonstrate the effectiveness of our method.
Generating Realistic Counterfactuals for Retinal Fundus and OCT Images using Diffusion Models
Ilanchezian, Indu, Boreiko, Valentyn, Kühlewein, Laura, Huang, Ziwei, Ayhan, Murat Seçkin, Hein, Matthias, Koch, Lisa, Berens, Philipp
Counterfactual reasoning is often used in clinical settings to explain decisions or weigh alternatives. Therefore, for imaging based specialties such as ophthalmology, it would be beneficial to be able to create counterfactual images, illustrating answers to questions like "If the subject had had diabetic retinopathy, how would the fundus image have looked?". Here, we demonstrate that using a diffusion model in combination with an adversarially robust classifier trained on retinal disease classification tasks enables the generation of highly realistic counterfactuals of retinal fundus images and optical coherence tomography (OCT) B-scans. The key to the realism of counterfactuals is that these classifiers encode salient features indicative for each disease class and can steer the diffusion model to depict disease signs or remove disease-related lesions in a realistic way. In a user study, domain experts also found the counterfactuals generated using our method significantly more realistic than counterfactuals generated from a previous method, and even indistinguishable from real images.