Wang, Jiahao
LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models
Yang, Runming, Wu, Taiqiang, Wang, Jiahao, Hu, Pengfei, Wong, Ngai, Yang, Yujiu
In this paper, we propose a novel LLM-Neo framework that efficiently transfers knowledge from a large language model (LLM) teacher to a compact student. Initially, we revisit the knowledge distillation (KD) and low-rank adaption (LoRA), and argue that they share the same paradigm. Inspired by this observation, we explore the strategy that combines LoRA and KD to enhance the efficiency of knowledge transfer. We first summarize some guidelines for this design and further develop the LLM-Neo. Experimental results on compressing Llama 2 and Llama 3 show that LLM-Neo outperforms various baselines. Further analysis demonstrates the robustness of the proposed LLM-Neo on variants of LoRA. The trained models have been available at \href{https://huggingface.co/collections/yang31210999/llm-neo-66e3c882f5579b829ff57eba}{this repository}.
Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model
Wang, Jiahao, Shalaby, Amer
Users of the transit system flood social networks daily with messages that contain valuable insights crucial for improving service quality. These posts help transit agencies quickly identify emerging issues. Parsing topics and sentiments is key to gaining comprehensive insights to foster service excellence. However, the volume of messages makes manual analysis impractical, and standard NLP techniques like Term Frequency-Inverse Document Frequency (TF-IDF) fall short in nuanced interpretation. Traditional sentiment analysis separates topics and sentiments before integrating them, often missing the interaction between them. This incremental approach complicates classification and reduces analytical productivity. To address these challenges, we propose a novel approach to extracting and analyzing transit-related information, including sentiment and sarcasm detection, identification of unusual system problems, and location data from social media. Our method employs Large Language Models (LLM), specifically Llama 3, for a streamlined analysis free from pre-established topic labels. To enhance the model's domain-specific knowledge, we utilize Retrieval-Augmented Generation (RAG), integrating external knowledge sources into the information extraction pipeline. We validated our method through extensive experiments comparing its performance with traditional NLP approaches on user tweet data from the real world transit system. Our results demonstrate the potential of LLMs to transform social media data analysis in the public transit domain, providing actionable insights and enhancing transit agencies' responsiveness by extracting a broader range of information.
DST-TransitNet: A Dynamic Spatio-Temporal Deep Learning Model for Scalable and Efficient Network-Wide Prediction of Station-Level Transit Ridership
Wang, Jiahao, Shalaby, Amer
Accurate prediction of public transit ridership is vital for efficient planning and management of transit in rapidly growing urban areas in Canada. Unexpected increases in passengers can cause overcrowded vehicles, longer boarding times, and service disruptions. Traditional time series models like ARIMA and SARIMA face limitations, particularly in short-term predictions and integration of spatial and temporal features. These models struggle with the dynamic nature of ridership patterns and often ignore spatial correlations between nearby stops. Deep Learning (DL) models present a promising alternative, demonstrating superior performance in short-term prediction tasks by effectively capturing both spatial and temporal features. However, challenges such as dynamic spatial feature extraction, balancing accuracy with computational efficiency, and ensuring scalability remain. This paper introduces DST-TransitNet, a hybrid DL model for system-wide station-level ridership prediction. This proposed model uses graph neural networks (GNN) and recurrent neural networks (RNN) to dynamically integrate the changing temporal and spatial correlations within the stations. The model also employs a precise time series decomposition framework to enhance accuracy and interpretability. Tested on Bogota's BRT system data, with three distinct social scenarios, DST-TransitNet outperformed state-of-the-art models in precision, efficiency and robustness. Meanwhile, it maintains stability over long prediction intervals, demonstrating practical applicability.
Leveraging Large Language Models for Enhancing Public Transit Services
Wang, Jiahao, Shalaby, Amer
Public transit systems play a crucial role in providing efficient and sustainable transportation options in urban areas. However, these systems face various challenges in meeting commuters' needs. On the other hand, despite the rapid development of Large Language Models (LLMs) worldwide, their integration into transit systems remains relatively unexplored. The objective of this paper is to explore the utilization of LLMs in the public transit system, with a specific focus on improving the customers' experience and transit staff performance. We present a general framework for developing LLM applications in transit systems, wherein the LLM serves as the intermediary for information communication between natural language content and the resources within the database. In this context, the LLM serves a multifaceted role, including understanding users' requirements, retrieving data from the dataset in response to user queries, and tailoring the information to align with the users' specific needs. Three transit LLM applications are presented: Tweet Writer, Trip Advisor, and Policy Navigator. Tweet Writer automates updates to the transit system alerts on social media, Trip Advisor offers customized transit trip suggestions, and Policy Navigator provides clear and personalized answers to policy queries. Leveraging LLMs in these applications enhances seamless communication with their capabilities of understanding and generating human-like languages. With the help of these three LLM transit applications, transit system media personnel can provide system updates more efficiently, and customers can access travel information and policy answers in a more user-friendly manner.
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Wang, Qiuheng, Shi, Yukai, Ou, Jiarong, Chen, Rui, Lin, Ke, Wang, Jiahao, Jiang, Boyuan, Yang, Haotian, Zheng, Mingwu, Tao, Xin, Yang, Fei, Wan, Pengfei, Zhang, Di
As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code will be released at https://koala36m.github.io/.
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
Chen, Mengzhao, Liu, Yi, Wang, Jiahao, Bin, Yi, Shao, Wenqi, Luo, Ping
Quantization is essential for deploying Large Language Models (LLMs) by enhancing memory efficiency and inference speed. Existing methods for activation quantization mainly address channel-wise outliers, often neglecting token-wise outliers, leading to reliance on costly per-token dynamic quantization. To address this, we introduce PrefixQuant, a novel technique that isolates outlier tokens offline without re-training. Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization. To our knowledge, PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and 4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5 common-sense reasoning tasks, outperforming previous per-token dynamic quantization methods like QuaRot with 0.98 perplexity improvement and +5.98 points accuracy. Additionally, the inference speed of W4A4 quantized models using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot models by 1.2x to 1.3x. Our code is available at \url{https://github.com/ChenMnZ/PrefixQuant}.
OneActor: Consistent Character Generation via Cluster-Conditioned Guidance
Wang, Jiahao, Yan, Caixia, Lin, Haonan, Zhang, Weizhan, Wang, Mengmeng, Gong, Tieliang, Dai, Guang, Sun, Hao
Text-to-image diffusion models benefit artists with high-quality image generation. Yet their stochastic nature hinders artists from creating consistent images of the same subject. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external restricted data or require expensive tuning of the diffusion model. For this issue, we propose a novel one-shot tuning paradigm, termed as OneActor. It efficiently performs consistent subject generation solely driven by prompts via a learned semantic guidance to bypass the laborious backbone tuning. We lead the way to formalize the objective of consistent subject generation from a clustering perspective, and thus design a cluster-conditioned model. To mitigate the overfitting challenge shared by one-shot tuning pipelines, we augment the tuning with auxiliary samples and devise two inference strategies: semantic interpolation and cluster guidance. These techniques are later verified to significantly enhance the generation quality. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory subject consistency, superior prompt conformity as well as high image quality. Our method is capable of multi-subject generation and compatible with popular diffusion extensions. Besides, we achieve a 4 times faster tuning speed than tuning-based baselines and, if desired, avoid increasing inference time. Furthermore, to our best knowledge, we are the first to prove that the semantic space of the diffusion model has the same interpolation property as the latent space does. This property can serve as another promising tool for fine generation control.
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Chen, Mengzhao, Shao, Wenqi, Xu, Peng, Wang, Jiahao, Gao, Peng, Zhang, Kaipeng, Qiao, Yu, Luo, Ping
Large language models (LLMs) are integral to modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it demands substantial training resources to optimize model weights and quantization parameters. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). Block-AP sequentially conducts quantization-aware training for all parameters in each transformer block with block-wise reconstruction, maintaining efficiency by avoiding training the entire LLM. Initialized with quantized model, E2E-QP then trains only quantization parameters (step sizes) end-to-end, enhancing efficiency with a fixed quantized backbone and reduced trainable parameter count. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3\% accuracy degradation compared to the full precision (69.48 vs. 72.41). Notably, this INT2 quantized 70B model obtains a 1.67 accuracy gain over the Llama-2-13B model (69.48 vs. 67.81) while requiring less memory (19.2GB vs. 24.2GB). Code is available at https://github.com/OpenGVLab/EfficientQAT.
Fast and Continual Knowledge Graph Embedding via Incremental LoRA
Liu, Jiajun, Ke, Wenjun, Wang, Peng, Wang, Jiahao, Gao, Jinhua, Shang, Ziyu, Li, Guozheng, Xu, Zijie, Ji, Ke, Li, Yining
Continual Knowledge Graph Embedding (CKGE) aims to efficiently learn new knowledge and simultaneously preserve old knowledge. Dominant approaches primarily focus on alleviating catastrophic forgetting of old knowledge but neglect efficient learning for the emergence of new knowledge. However, in real-world scenarios, knowledge graphs (KGs) are continuously growing, which brings a significant challenge to fine-tuning KGE models efficiently. To address this issue, we propose a fast CKGE framework (\model), incorporating an incremental low-rank adapter (\mec) mechanism to efficiently acquire new knowledge while preserving old knowledge. Specifically, to mitigate catastrophic forgetting, \model\ isolates and allocates new knowledge to specific layers based on the fine-grained influence between old and new KGs. Subsequently, to accelerate fine-tuning, \model\ devises an efficient \mec\ mechanism, which embeds the specific layers into incremental low-rank adapters with fewer training parameters. Moreover, \mec\ introduces adaptive rank allocation, which makes the LoRA aware of the importance of entities and adjusts its rank scale adaptively. We conduct experiments on four public datasets and two new datasets with a larger initial scale. Experimental results demonstrate that \model\ can reduce training time by 34\%-49\% while still achieving competitive link prediction performance against state-of-the-art models on four public datasets (average MRR score of 21.0\% vs. 21.1\%).Meanwhile, on two newly constructed datasets, \model\ saves 51\%-68\% training time and improves link prediction performance by 1.5\%.
Mixture-of-Subspaces in Low-Rank Adaptation
Wu, Taiqiang, Wang, Jiahao, Zhao, Zhe, Wong, Ngai
In this paper, we introduce a subspace-inspired Low-Rank Adaptation (LoRA) method, which is computationally efficient, easy to implement, and readily applicable to large language, multimodal, and diffusion models. Initially, we equivalently decompose the weights of LoRA into two subspaces, and find that simply mixing them can enhance performance. To study such a phenomenon, we revisit it through a fine-grained subspace lens, showing that such modification is equivalent to employing a fixed mixer to fuse the subspaces. To be more flexible, we jointly learn the mixer with the original LoRA weights, and term the method Mixture-of-Subspaces LoRA (MoSLoRA). MoSLoRA consistently outperforms LoRA on tasks in different modalities, including commonsense reasoning, visual instruction tuning, and subject-driven text-to-image generation, demonstrating its effectiveness and robustness. Codes are available at https://github.com/wutaiqiang/MoSLoRA.