Yuan, Tao
Megrez-Omni Technical Report
Li, Boxun, Li, Yadong, Li, Zhiyuan, Liu, Congyi, Liu, Weilin, Niu, Guowei, Tan, Zheyue, Xu, Haiyang, Yao, Zhuyu, Yuan, Tao, Zhou, Dong, Zhuang, Yueqing, Yan, Shengen, Dai, Guohao, Wang, Yu
In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
Gao, Zhi, Zhang, Bofei, Li, Pengxiang, Ma, Xiaojian, Yuan, Tao, Fan, Yue, Wu, Yuwei, Jia, Yunde, Zhu, Song-Chun, Li, Qing
Query: I want to buy a PS5 for each child in the photo. Thought: Use the `facedetection` tool to detect Thought: First analyze the image 1 to find the number human faces in the two images. Faces in Image 1: 4 bounding boxes Thought: There are 4 children in total. The price of Price of PS5: $479.99 a PS5 is approximately $500, so the cost is 4* 500. Thought: Using the price of $479.99 for each console. Query: The men in the picture want to buy one NVIDIA GeForce RTX 4070 SUPER each. According to the price in January, how many dollars will they need to spend in total? Observation: This image does not provide any price. On January 8, 2024, Nvidia released the RTX Thought: I cannot obtain useful information. I 4070 SUPER at $599, think the price is about $1800 for three men. Thought: The price is $599. Our agent chooses more precise tools based on the given files and intermediate observations. The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via Trajectory Tuning on VLMs for Tool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B Integrating external tools to solve diverse multi-modal tasks is a promising research direction towards multi-modal agents (Surรญs et al., 2023; Gupta & Kembhavi, 2023; Gao et al., 2024; Yuan et al., 2024; Zhong et al., 2023). Existing agents usually use a large language model (LLM) as the controller that generates plans via prompt engineering to call tools, achieving impressive performance in multiple domains, such as image editing (Wu et al., 2023), robotic manipulation (ichter et al., 2023), question answering (Shen et al., 2024), video understanding (Fan et al., 2024), and desktop APPs (Trivedi et al., 2024). Despite their success, prompt engineering faces limited reasoning abilities for tool usage in tackling practical tasks, as shown in Figure 1.
Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs
Zhao, Kang, Yuan, Tao, Bao, Han, Su, Zhenfeng, Gao, Chang, Sun, Zhaofeng, Liang, Zichen, Jing, Liping, Chen, Jianfei
To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios.
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
Yuan, Tao, Ning, Xuefei, Zhou, Dong, Yang, Zhijie, Li, Shiyao, Zhuang, Minghui, Tan, Zheyue, Yao, Zhuyu, Lin, Dahua, Li, Boxun, Dai, Guohao, Yan, Shengen, Wang, Yu
State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 10 LLMs on LV-Eval and conduct ablation studies on the techniques used in LV-Eval construction. The results reveal that: (i) Commercial LLMs generally outperform open-source LLMs when evaluated within length levels shorter than their claimed context length. However, their overall performance is surpassed by open-source LLMs with longer context lengths. (ii) Extremely long-context LLMs, such as Yi-6B-200k, exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.
Structured Attention for Unsupervised Dialogue Structure Induction
Qiu, Liang, Zhao, Yizhou, Shi, Weiyan, Liang, Yuan, Shi, Feng, Yuan, Tao, Yu, Zhou, Zhu, Song-Chun
Inducing a meaningful structural representation from one or a set of dialogues is a crucial but challenging task in computational linguistics. Advancement made in this area is critical for dialogue system design and discourse analysis. It can also be extended to solve grammatical inference. In this work, we propose to incorporate structured attention layers into a Variational Recurrent Neural Network (VRNN) model with discrete latent states to learn dialogue structure in an unsupervised fashion. Compared to a vanilla VRNN, structured attention enables a model to focus on different parts of the source sentence embeddings while enforcing a structural inductive bias. Experiments show that on two-party dialogue datasets, VRNN with structured attention learns semantic structures that are similar to templates used to generate this dialogue corpus. While on multi-party dialogue datasets, our model learns an interactive structure demonstrating its capability of distinguishing speakers or addresses, automatically disentangling dialogues without explicit human annotation.
Scene-Centric Joint Parsing of Cross-View Videos
Qi, Hang (University of California, Los Angeles) | Xu, Yuanlu (University of California, Los Angeles) | Yuan, Tao (University of California, Los Angeles) | Wu, Tianfu (NC State University) | Zhu, Song-Chun (University of California, Los Angeles)
Cross-view video understanding is an important yet under-explored area in computer vision. In this paper, we introduce a joint parsing framework that integrates view-centric proposals into scene-centric parse graphs that represent a coherent scene-centric understanding of cross-view scenes. Our key observations are that overlapping fields of views embed rich appearance and geometry correlations and that knowledge fragments corresponding to individual vision tasks are governed by consistency constraints available in commonsense knowledge. The proposed joint parsing framework represents such correlations and constraints explicitly and generates semantic scene-centric parse graphs. Quantitative experiments show that scene-centric predictions in the parse graph outperform view-centric predictions.