Liu, Yudong
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Liu, Yudong, Sun, Jingwei, Lin, Yueqian, Zhang, Jingyang, Yin, Ming, Wang, Qinsi, Zhang, Jianyi, Li, Hai, Chen, Yiran
Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
Lin, Yueqian, Fu, Yuzhe, Zhang, Jingyang, Liu, Yudong, Zhang, Jianyi, Sun, Jingwei, Li, Hai "Helen", Chen, Yiran
These benchmarks contain only short audio clips and thus do not reflect the complexity of achieving long-context Speech Large Language Models (Speech LLMs) represent a understanding and extracting precise information from lengthy significant advancement in speech language understanding and audio sequences. To systematically assess the unique challenges processing, as they leverage contextual reasoning capabilities of posed by SIR, we present SPIRAL (Speech Informational large language models to process audio inputs [1]. Unlike traditional Retrieval and Lookup), a 1,012-sample benchmark specifically cascaded pipelines, where automatic speech recognition crafted to evaluate Speech LLM performance on long-form (ASR) and language modeling are handled by separate modules, audio sequences (around 90 seconds in duration). On a high Speech LLMs unify audio processing, cross-modal fusion, and level, SPIRAL constructs SIR questions by embedding a critical language modeling in a single architecture [2]. These unified piece of information within lengthy and potentially distracting models can perform multiple tasks like speech recognition, dialogues, thereby assessing the model ability to pinpoint and speech translation, speaker identification and emotion recognition, retrieve essential content from long-form inputs.
Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors
Chen, Siyuan, Guan, Zelong, Liu, Yudong, Gibbons, Phillip B.
Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU. In this paper, we present an offloading framework, LSP_Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned subspace projectors. Our data-driven approach involves learning an efficient sparse compressor that minimizes communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 7 billion parameter model on an NVIDIA RTX 4090 GPU with 24GB memory, achieving only a 31% slowdown compared to fine-tuning with unlimited memory. Compared to state-of-the-art offloading frameworks, our approach increases fine-tuning throughput by up to 3.33 times and reduces end-to-end fine-tuning time by 33.1%~62.5% when converging to the same accuracy.
Yi: Open Foundation Models by 01.AI
AI, 01., :, null, Young, Alex, Chen, Bei, Li, Chao, Huang, Chengen, Zhang, Ge, Zhang, Guanwei, Li, Heng, Zhu, Jiangcheng, Chen, Jianqun, Chang, Jing, Yu, Kaidong, Liu, Peng, Liu, Qiang, Yue, Shawn, Yang, Senbin, Yang, Shiming, Yu, Tao, Xie, Wen, Huang, Wenhao, Hu, Xiaohui, Ren, Xiaoyi, Niu, Xinyao, Nie, Pengcheng, Xu, Yuchi, Liu, Yudong, Wang, Yue, Cai, Yuxuan, Gu, Zhenyu, Liu, Zhiyuan, Dai, Zonghong
We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Zhang, Ge, Du, Xinrun, Chen, Bei, Liang, Yiming, Luo, Tongxu, Zheng, Tianyu, Zhu, Kang, Cheng, Yuyang, Xu, Chunpu, Guo, Shuyue, Zhang, Haoran, Qu, Xingwei, Wang, Junjie, Yuan, Ruibin, Li, Yizhi, Wang, Zekun, Liu, Yudong, Tsai, Yu-Hsuan, Zhang, Fengji, Lin, Chenghua, Huang, Wenhao, Chen, Wenhu, Fu, Jie
As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.
Why does Prediction Accuracy Decrease over Time? Uncertain Positive Learning for Cloud Failure Prediction
Li, Haozhe, Ma, Minghua, Liu, Yudong, Zhao, Pu, Zheng, Lingling, Li, Ze, Dang, Yingnong, Chintalapati, Murali, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei
With the rapid growth of cloud computing, a variety of software services have been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on failure instance (disk, node, and switch, etc.) prediction. Once the output of prediction is positive, mitigation actions are taken to rapidly resolve the underlying failure. According to our real-world practice in Microsoft Azure, we find that the prediction accuracy may decrease by about 9% after retraining the models. Considering that the mitigation actions may result in uncertain positive instances since they cannot be verified after mitigation, which may introduce more noise while updating the prediction model. To the best of our knowledge, we are the first to identify this Uncertain Positive Learning (UPLearning) issue in the real-world cloud failure prediction scenario. To tackle this problem, we design an Uncertain Positive Learning Risk Estimator (Uptake) approach. Using two real-world datasets of disk failure prediction and conducting node prediction experiments in Microsoft Azure, which is a top-tier cloud provider that serves millions of users, we demonstrate Uptake can significantly improve the failure prediction accuracy by 5% on average.
ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection
Chen, Yuhang, Zhang, Chaoyun, Ma, Minghua, Liu, Yudong, Ding, Ruomeng, Li, Bowen, He, Shilin, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei
Anomaly detection in multivariate time series data is of paramount importance for ensuring the efficient operation of large-scale systems across diverse domains. However, accurately detecting anomalies in such data poses significant challenges. Existing approaches, including forecasting and reconstruction-based methods, struggle to address these challenges effectively. To overcome these limitations, we propose a novel anomaly detection framework named ImDiffusion, which combines time series imputation and diffusion models to achieve accurate and robust anomaly detection. The imputation-based approach employed by ImDiffusion leverages the information from neighboring values in the time series, enabling precise modeling of temporal and inter-correlated dependencies, reducing uncertainty in the data, thereby enhancing the robustness of the anomaly detection process. ImDiffusion further leverages diffusion models as time series imputers to accurately capturing complex dependencies. We leverage the step-by-step denoised outputs generated during the inference process to serve as valuable signals for anomaly prediction, resulting in improved accuracy and robustness of the detection process. We evaluate the performance of ImDiffusion via extensive experiments on benchmark datasets. The results demonstrate that our proposed framework significantly outperforms state-of-the-art approaches in terms of detection accuracy and timeliness. ImDiffusion is further integrated into the real production system in Microsoft and observe a remarkable 11.4% increase in detection F1 score compared to the legacy approach. To the best of our knowledge, ImDiffusion represents a pioneering approach that combines imputation-based techniques with time series anomaly detection, while introducing the novel use of diffusion models to the field.
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Deng, Zihao, Ma, Yinghao, Liu, Yudong, Guo, Rongchen, Zhang, Ge, Chen, Wenhu, Huang, Wenhao, Benetos, Emmanouil
Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen Vicuna-7B language model (an adaption of LLaMA), bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q\&A datasets, we created the Music Instruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
Diffusion-based Time Series Data Imputation for Microsoft 365
Yang, Fangkai, Yin, Wenjie, Wang, Lu, Li, Tianci, Zhao, Pu, Liu, Bo, Wang, Paul, Qiao, Bo, Liu, Yudong, Bjรถrkman, Mรฅrten, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei
Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing works focus on predicting cloud failures and proactively taking action before failures happen. However, they suffer from poor data quality like data missing in model training and prediction, which limits the performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently based on the observed data. Our experiments and application practice show that our model contributes to improving the performance of the downstream failure prediction task.
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Liang, Paul Pu, Lyu, Yiwei, Fan, Xiang, Tsaw, Jeffrey, Liu, Yudong, Mo, Shentong, Yogatama, Dani, Morency, Louis-Philippe, Salakhutdinov, Ruslan
Many real-world problems are inherently multimodal, from spoken language, gestures, and paralinguistics humans use to communicate, to force, proprioception, and visual sensors on robots. While there has been an explosion of interest in multimodal learning, these methods are focused on a small set of modalities primarily in language, vision, and audio. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. Since adding new models for every new modality becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? This paper proposes two new information theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities {X1,X2} are by measuring how much information can be transferred from X1 to X2, while (2) interaction heterogeneity studies how similarly pairs of modalities {X1,X2}, {X3,X4} interact by measuring how much information can be transferred from fusing {X1,X2} to {X3,X4}. We show the importance of these 2 proposed metrics as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks during fine-tuning.