Xie, Hongtao
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Liu, Zhihang, Xie, Chen-Wei, Li, Pandeng, Zhao, Liming, Tang, Longxiang, Zheng, Yun, Liu, Chuanbin, Xie, Hongtao
Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Liu, Zhihang, Xie, Chen-Wei, Wen, Bin, Yu, Feiwu, Chen, Jixuan, Zhang, Boqiang, Yang, Nianzu, Li, Pandeng, Zheng, Yun, Xie, Hongtao
Recent advancements in Multimodal Large Language Models (MLLMs) have rendered traditional visual captioning benchmarks obsolete, as they primarily evaluate short descriptions with outdated metrics. While recent benchmarks address these limitations by decomposing captions into visual elements and adopting model-based evaluation, they remain incomplete-overlooking critical aspects, while providing vague, non-explanatory scores. To bridge this gap, we propose CV-CapBench, a Comprehensive Visual Caption Benchmark that systematically evaluates caption quality across 6 views and 13 dimensions. CV-CapBench introduces precision, recall, and hit rate metrics for each dimension, uniquely assessing both correctness and coverage. Experiments on leading MLLMs reveal significant capability gaps, particularly in dynamic and knowledge-intensive dimensions. These findings provide actionable insights for future research. The code and data will be released.
A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions
Wang, Jiankang, Xu, Jianjun, Wang, Xiaorui, Wang, Yuxin, Xing, Mengting, Fang, Shancheng, Chen, Zhineng, Xie, Hongtao, Zhang, Yongdong
Synthesizing high-quality reasoning data for continual training has been proven to be effective in enhancing the performance of Large Language Models (LLMs). However, previous synthetic approaches struggle to easily scale up data and incur high costs in the pursuit of high quality. In this paper, we propose the Graphbased Synthetic Data Pipeline (GSDP), an economical and scalable framework for high-quality reasoning data synthesis. Inspired by knowledge graphs, we extracted knowledge points from seed data and constructed a knowledge point relationships graph to explore their interconnections. By exploring the implicit relationships among knowledge, our method achieves 255 data expansion. Furthermore, GSDP led by open-source models, achieves synthesis quality comparable to GPT-4-0613 while maintaining 100 lower costs. To tackle the most challenging mathematical reasoning task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating the effectiveness of our method. The dataset and models trained in this paper will be available. Despite the remarkable capabilities large language models (LLMs) have demonstrated in various linguistic tasks, significant gaps remain in their ability to comprehend and solve intricate reasoning tasks (e.g., mathematics, coding, physics, and chemistry). One effective approach to bridging these gaps is using large-scale, high-quality synthetic data. However, it is still a challenge to develop a low-cost and effective synthesis pipeline. Take mathematics as an example. The two main approaches for building high-quality mathematics reasoning datasets are data filtering and data synthesis. Data filtering (Yue et al., 2024b; Shao et al., 2024; Ying et al., 2024) involves extracting data from pre-training corpora such as Common Crawl, and rewriting it using advanced commercial models or human annotation.
Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation
Bai, Yu, Miao, Yukai, Chen, Li, Li, Dan, Ren, Yanyu, Xie, Hongtao, Yang, Ce, Cai, Xuhui
In Greek mythology, Pistis symbolized good faith, trust, and reliability. Drawing inspiration from these principles, Pistis-RAG is a scalable multi-stage framework designed to address the challenges of large-scale retrieval-augmented generation (RAG) systems. This framework consists of distinct stages: matching, pre-ranking, ranking, reasoning, and aggregating. Each stage contributes to narrowing the search space, prioritizing semantically relevant documents, aligning with the large language model's (LLM) preferences, supporting complex chain-of-thought (CoT) methods, and combining information from multiple sources. Our ranking stage introduces a significant innovation by recognizing that semantic relevance alone may not lead to improved generation quality, due to the sensitivity of the few-shot prompt order, as noted in previous research. This critical aspect is often overlooked in current RAG frameworks. We argue that the alignment issue between LLMs and external knowledge ranking methods is tied to the model-centric paradigm dominant in RAG systems. We propose a content-centric approach, emphasizing seamless integration between LLMs and external information sources to optimize content transformation for specific tasks. Our novel ranking stage is designed specifically for RAG systems, incorporating principles of information retrieval while considering the unique business scenarios reflected in LLM preferences and user feedback. We simulated feedback signals on the MMLU benchmark, resulting in a 9.3% performance improvement. Our model and code will be open-sourced on GitHub. Additionally, experiments on real-world, large-scale data validate the scalability of our framework.
DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
Sun, Yuhao, Yu, Lingyun, Xie, Hongtao, Li, Jiaming, Zhang, Yongdong
With the rapid development of face recognition (FR) systems, the privacy of face images on social media is facing severe challenges due to the abuse of unauthorized FR systems. Some studies utilize adversarial attack techniques to defend against malicious FR systems by generating adversarial examples. However, the generated adversarial examples, i.e., the protected face images, tend to suffer from subpar visual quality and low transferability. In this paper, we propose a novel face protection approach, dubbed DiffAM, which leverages the powerful generative ability of diffusion models to generate high-quality protected face images with adversarial makeup transferred from reference images. To be specific, we first introduce a makeup removal module to generate non-makeup images utilizing a fine-tuned diffusion model with guidance of textual prompts in CLIP space. As the inverse process of makeup transfer, makeup removal can make it easier to establish the deterministic relationship between makeup domain and non-makeup domain regardless of elaborate text prompts. Then, with this relationship, a CLIP-based makeup loss along with an ensemble attack strategy is introduced to jointly guide the direction of adversarial makeup domain, achieving the generation of protected face images with natural-looking makeup and high black-box transferability. Extensive experiments demonstrate that DiffAM achieves higher visual quality and attack success rates with a gain of 12.98% under black-box setting compared with the state of the arts. The code will be available at https://github.com/HansSunY/DiffAM.
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
Lu, Zhiying, Xie, Hongtao, Liu, Chuanbin, Zhang, Yongdong
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.
CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition
Zheng, Tianlun, Chen, Zhineng, Fang, Shancheng, Xie, Hongtao, Jiang, Yu-Gang
The attention-based encoder-decoder framework is becoming popular in scene text recognition, largely due to its superiority in integrating recognition clues from both visual and semantic domains. However, recent studies show the two clues might be misaligned in the difficult text (e.g., with rare text shapes) and introduce constraints such as character position to alleviate the problem. Despite certain success, a content-free positional embedding hardly associates with meaningful local image regions stably. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visual and semantic related position encoding. MDCDP uses positional embedding to query both visual and semantic features following the attention mechanism. It naturally encodes the positional clue, which describes both visual and semantic distances among characters. We develop a novel architecture named CDistNet that stacks MDCDP several times to guide precise distance modeling. Thus, the visual-semantic alignment is well built even various difficulties presented. We apply CDistNet to two augmented datasets and six public benchmarks. The experiments demonstrate that CDistNet achieves state-of-the-art recognition accuracy. While the visualization also shows that CDistNet achieves proper attention localization in both visual and semantic domains. We will release our code upon acceptance.