Tian, Yu
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
Wang, Weizhi, Tian, Yu, Yang, Linjie, Wang, Heng, Yan, Xifeng
The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.
Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers
Chen, Yifang, Li, Xiaoyu, Liang, Yingyu, Shi, Zhenmei, Song, Zhao, Tian, Yu
The key-value (KV) cache in autoregressive transformers presents a significant bottleneck during inference, which restricts the context length capabilities of large language models (LLMs). While previous work analyzes the fundamental space complexity barriers in standard attention mechanism [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a novel reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = \Omega(\log n)$. In the low dimensional regime where $d = o(\log n)$, we analyze the theoretical bounds of the space complexity as well. Overall, our work provides a theoretical foundation for us to understand the compression-expressivity tradeoff in tensor attention mechanisms and offers more perspectives in developing more memory-efficient transformer architectures.
Theoretical Guarantees for High Order Trajectory Refinement in Generative Flows
Gong, Chengyue, Li, Xiaoyu, Liang, Yingyu, Long, Jiangxuan, Shi, Zhenmei, Song, Zhao, Tian, Yu
Flow matching has emerged as a powerful framework for generative modeling, offering computational advantages over diffusion models by leveraging deterministic Ordinary Differential Equations (ODEs) instead of stochastic dynamics. While prior work established the worst case optimality of standard flow matching under Wasserstein distances, the theoretical guarantees for higher-order flow matching - which incorporates acceleration terms to refine sample trajectories - remain unexplored. In this paper, we bridge this gap by proving that higher-order flow matching preserves worst case optimality as a distribution estimator. We derive upper bounds on the estimation error for second-order flow matching, demonstrating that the convergence rates depend polynomially on the smoothness of the target distribution (quantified via Besov spaces) and key parameters of the ODE dynamics. Our analysis employs neural network approximations with carefully controlled depth, width, and sparsity to bound acceleration errors across both small and large time intervals, ultimately unifying these results into a general worst case optimal bound for all time steps.
RoboGrasp: A Universal Grasping Policy for Robust Robotic Control
Huang, Yiqi, Davies, Travis, Yan, Jiahuan, Chen, Xiang, Tian, Yu, Hu, Luhui
Imitation learning and world models have shown significant promise in advancing generalizable robotic learning, with robotic grasping remaining a critical challenge for achieving precise manipulation. Existing methods often rely heavily on robot arm state data and RGB images, leading to overfitting to specific object shapes or positions. To address these limitations, we propose RoboGrasp, a universal grasping policy framework that integrates pretrained grasp detection models with robotic learning. By leveraging robust visual guidance from object detection and segmentation tasks, RoboGrasp significantly enhances grasp precision, stability, and generalizability, achieving up to 34% higher success rates in few-shot learning and grasping box prompt tasks. Built on diffusion-based methods, RoboGrasp is adaptable to various robotic learning paradigms, enabling precise and reliable manipulation across diverse and complex scenarios. This framework represents a scalable and versatile solution for tackling real-world challenges in robotic grasping.
FairDiffusion: Enhancing Equity in Latent Diffusion Models via Fair Bayesian Perturbation
Luo, Yan, Khan, Muhammad Osama, Wen, Congcong, Afzal, Muhammad Muneeb, Wuermeling, Titus Fidelis, Shi, Min, Tian, Yu, Fang, Yi, Wang, Mengyu
Recent progress in generative AI, especially diffusion models, has demonstrated significant utility in text-to-image synthesis. Particularly in healthcare, these models offer immense potential in generating synthetic datasets and training medical students. However, despite these strong performances, it remains uncertain if the image generation quality is consistent across different demographic subgroups. To address this critical concern, we present the first comprehensive study on the fairness of medical text-to-image diffusion models. Our extensive evaluations of the popular Stable Diffusion model reveal significant disparities across gender, race, and ethnicity. To mitigate these biases, we introduce FairDiffusion, an equity-aware latent diffusion model that enhances fairness in both image generation quality as well as the semantic correlation of clinical features. In addition, we also design and curate FairGenMed, the first dataset for studying the fairness of medical generative models. Complementing this effort, we further evaluate FairDiffusion on two widely-used external medical datasets: HAM10000 (dermatoscopic images) and CheXpert (chest X-rays) to demonstrate FairDiffusion's effectiveness in addressing fairness concerns across diverse medical imaging modalities. Together, FairDiffusion and FairGenMed significantly advance research in fair generative learning, promoting equitable benefits of generative AI in healthcare.
AI-generated Image Quality Assessment in Visual Communication
Tian, Yu, Li, Yixuan, Chen, Baoliang, Zhu, Hanwei, Wang, Shiqi, Kwong, Sam
Assessing the quality of artificial intelligence-generated images (AIGIs) plays a crucial role in their application in real-world scenarios. However, traditional image quality assessment (IQA) algorithms primarily focus on low-level visual perception, while existing IQA works on AIGIs overemphasize the generated content itself, neglecting its effectiveness in real-world applications. To bridge this gap, we propose AIGI-VC, a quality assessment database for AI-Generated Images in Visual Communication, which studies the communicability of AIGIs in the advertising field from the perspectives of information clarity and emotional interaction. The dataset consists of 2,500 images spanning 14 advertisement topics and 8 emotion types. It provides coarse-grained human preference annotations and fine-grained preference descriptions, benchmarking the abilities of IQA methods in preference prediction, interpretation, and reasoning. We conduct an empirical study of existing representative IQA methods and large multi-modal models on the AIGI-VC dataset, uncovering their strengths and weaknesses.
Dirac-Equation Signal Processing: Physics Boosts Topological Machine Learning
Wang, Runyue, Tian, Yu, Liรฒ, Pietro, Bianconi, Ginestra
Topological signals are variables or features associated with both nodes and edges of a network. Recently, in the context of Topological Machine Learning, great attention has been devoted to signal processing of such topological signals. Most of the previous topological signal processing algorithms treat node and edge signals separately and work under the hypothesis that the true signal is smooth and/or well approximated by a harmonic eigenvector of the Hodge-Laplacian, which may be violated in practice. Here we propose Dirac-equation signal processing, a framework for efficiently reconstructing true signals on nodes and edges, also if they are not smooth or harmonic, by processing them jointly. The proposed physics-inspired algorithm is based on the spectral properties of the topological Dirac operator. It leverages the mathematical structure of the topological Dirac equation to boost the performance of the signal processing algorithm. We discuss how the relativistic dispersion relation obeyed by the topological Dirac equation can be used to assess the quality of the signal reconstruction. Finally, we demonstrate the improved performance of the algorithm with respect to previous algorithms. Specifically, we show that Dirac-equation signal processing can also be used efficiently if the true signal is a non-trivial linear combination of more than one eigenstate of the Dirac equation, as it generally occurs for real signals.
Spatially Visual Perception for End-to-End Robotic Learning
Davies, Travis, Yan, Jiahuan, Chen, Xiang, Tian, Yu, Zhuang, Yueting, Huang, Yiqi, Hu, Luhui
Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Zeng, Xinyi, Shang, Yuying, Zhu, Yutao, Chen, Jiawei, Tian, Yu
Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful responses from the prefill-level lacks utilization of the model's decoding outputs, leading to relatively lower effectiveness and robustness. This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Our novel decoder-oriented, step-bystep defense architecture corrects harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost secure decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model's ability to discern hazardous information, maintaining its helpfulness compared to existing methods. In recent years, significant progress has been made in developing large language models (LLMs). Meanwhile, the safety of LLMs has attracted significant attention from the research community and industry (Weidinger et al., 2021; Achiam et al., 2023; Wu et al., 2023b). One of the primary safety concerns is jailbreaking, where malicious actors or errant inputs prompt LLMs to produce harmful or inappropriate content, effectively bypassing ethical guidelines.
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
Shang, Yuying, Zeng, Xinyi, Zhu, Yutao, Yang, Xiao, Fang, Zhengwei, Zhang, Jingyuan, Chen, Jiawei, Liu, Zinan, Tian, Yu
Hallucinations in large vision-language models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input, which impairs their reliability. Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to effectively extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling). Motivated by our findings on the preliminary investigation, we propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs. This plug-and-play method can be integrated into various LVLMs, utilizing adaptive virtual tokens to extract object features from bounding boxes, thereby addressing hallucinations caused by insufficient decoupling of visual features. PATCH achieves state-of-the-art performance on multiple multi-modal hallucination datasets. We hope this approach provides researchers with deeper insights into the underlying causes of hallucinations in LVLMs, fostering further advancements and innovation in this field. Large vision-language models (LVLMs) have demonstrated remarkable performance across a broad range of tasks, even surpassing human capabilities in specific scenarios (Xu et al., 2023; Li et al., 2023a; Zhang et al., 2024a). However, their practical applications are hindered by multi-modal hallucinations, where models generate factually incorrect, inconsistent, or entirely fictitious outputs when interpreting visual features.