Jiang, Boyuan
Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
Xu, Pengcheng, Jiang, Boyuan, Hu, Xiaobin, Luo, Donghao, He, Qingdong, Zhang, Jiangning, Wang, Chengjie, Wu, Yunsheng, Ling, Charles, Wang, Boyu
Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Wang, Qiuheng, Shi, Yukai, Ou, Jiarong, Chen, Rui, Lin, Ke, Wang, Jiahao, Jiang, Boyuan, Yang, Haotian, Zheng, Mingwu, Tao, Xin, Yang, Fei, Wan, Pengfei, Zhang, Di
As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code will be released at https://koala36m.github.io/.
NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models
Wu, Kai, Jiang, Boyuan, Jiang, Zhengkai, He, Qingdong, Luo, Donghao, Wang, Shengzhi, Liu, Qingwen, Wang, Chengjie
Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at https://kaiwu5.github.io/noiseboost.
Parameter Transfer Extreme Learning Machine based on Projective Model
Chen, Chao, Jiang, Boyuan, Jin, Xinyu
Recent years, transfer learning has attracted much attention in the community of machine learning. In this paper, we mainly focus on the tasks of parameter transfer under the framework of extreme learning machine (ELM). Unlike the existing parameter transfer approaches, which incorporate the source model information into the target by regularizing the di erence between the source and target domain parameters, an intuitively appealing projective-model is proposed to bridge the source and target model parameters. Specifically, we formulate the parameter transfer in the ELM networks by the means of parameter projection, and train the model by optimizing the projection matrix and classifier parameters jointly. Further more, the `L2,1-norm structured sparsity penalty is imposed on the source domain parameters, which encourages the joint feature selection and parameter transfer. To evaluate the e ectiveness of the proposed method, comprehensive experiments on several commonly used domain adaptation datasets are presented. The results show that the proposed method significantly outperforms the non-transfer ELM networks and other classical transfer learning methods.
Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation
Chen, Chao, Chen, Zhihong, Jiang, Boyuan, Jin, Xinyu
Recently, considerable effort has been devoted to deep domain adaptation in computer vision and machine learning communities. However, most of existing work only concentrates on learning shared feature representation by minimizing the distribution discrepancy across different domains. Due to the fact that all the domain alignment approaches can only reduce, but not remove the domain shift. Target domain samples distributed near the edge of the clusters, or far from their corresponding class centers are easily to be misclassified by the hyperplane learned from the source domain. To alleviate this issue, we propose to joint domain alignment and discriminative feature learning, which could benefit both domain alignment and final classification. Specifically, an instance-based discriminative feature learning method and a center-based discriminative feature learning method are proposed, both of which guarantee the domain invariant features with better intra-class compactness and inter-class separability. Extensive experiments show that learning the discriminative features in the shared feature space can significantly boost the performance of deep domain adaptation methods.