Goto

Collaborating Authors

 Zhao, Jishen


SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters

arXiv.org Artificial Intelligence

While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, we propose SHARP (SHaring Adjacent Layers with Recovery Parameters), a novel approach to accelerate LLM inference by sharing parameters across adjacent layers, thus reducing memory load overhead, while introducing low-rank recovery parameters to maintain performance. Inspired by observations that consecutive layers have similar outputs, SHARP employs a two-stage recovery process: Single Layer Warmup (SLW), and Supervised Fine-Tuning (SFT). Extensive experiments demonstrate that SHARP can recover the model's perplexity on various in-distribution tasks using no more than 50k fine-tuning data while reducing the number of stored MLP parameters by 38% to 65%. We also conduct several ablation studies of SHARP and show that replacing layers towards the later parts of the model yields better performance retention, and that different recovery parameterizations perform similarly when parameter counts are matched. Furthermore, SHARP saves 42.8% in model storage and reduces the total inference time by 42.2% compared to the original Llama2-7b model on mobile devices. Our results highlight SHARP as an efficient solution for reducing inference costs in deploying LLMs without the need for pretraining-scale resources. However, deploying a pre-trained large language model requires significant computational and memory resources (Aminabadi et al., 2022; Pope et al., 2023; Kim et al., 2023b; Zhang et al., 2024b), which may further restrict their inference speed. For instance, a 70-billion-parameter language model stored in FP16 precision requires approximately 148GB of memory to hold the model weights, necessitating two A100 GPUs with 80GB of memory each to load the entire model. During inference, the entire input sequence and the KV cache are also stored on the GPU, incurring additional memory usage. They repeat the layer twice and train the model from scratch. SHARP leverages fine-tuning-scale data to train additional parameters Θ, which consist of far fewer parameters than the original Θ, in order to recover the model's performance. In this paper, we explore several candidate transformations, including the LoRA-style function, to apply on additional parameters. In particular, these concerns are significant for deployment on mobile devices, which typically have smaller DRAM (e.g., around 6GB in the iPhone 15) and higher communication overhead (Liu et al., 2024).


OrcaLoca: An LLM Agent Framework for Software Issue Localization

arXiv.org Artificial Intelligence

Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.


MAGE: A Multi-Agent Engine for Automated RTL Code Generation

arXiv.org Artificial Intelligence

The automatic generation of RTL code (e.g., Verilog) through natural language instructions has emerged as a promising direction with the advancement of large language models (LLMs). However, producing RTL code that is both syntactically and functionally correct remains a significant challenge. Existing single-LLM-agent approaches face substantial limitations because they must navigate between various programming languages and handle intricate generation, verification, and modification tasks. To address these challenges, this paper introduces MAGE, the first open-source multi-agent AI system designed for robust and accurate Verilog RTL code generation. We propose a novel high-temperature RTL candidate sampling and debugging system that effectively explores the space of code candidates and significantly improves the quality of the candidates. Furthermore, we design a novel Verilog-state checkpoint checking mechanism that enables early detection of functional errors and delivers precise feedback for targeted fixes, significantly enhancing the functional correctness of the generated RTL code. MAGE achieves a 95.7% rate of syntactic and functional correctness code generation on VerilogEval-Human 2 benchmark, surpassing the state-of-the-art Claude-3.5-sonnet by 23.3 %, demonstrating a robust and reliable approach for AI-driven RTL design workflows.


Grounding Large Language Models In Embodied Environment With Imperfect World Models

arXiv.org Artificial Intelligence

Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04, 1.54, and 1.82 across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4. Recent advances in Large Language Models (LLMs) is bringing great transformation in various robotics applications, such as self-driving cars (Mao et al. (2023)), autonomous drones (Vemprala et al. (2023)), robotics manipulation (Liang et al. (2022)). LLMs are able to enhance robots with rich common-sense knowledge and complex planning capabilities. However, LLM needs to be physically grounded in reality, which includes understanding of the environment dynamics, the task-related constraints, and the consequences of its actions (Gao et al. (2023); Rana et al. (2023)). Many previous works in robot learning heavily rely on prompting, such as (1) decomposing problem structures using human priors (Rana et al. (2023); Liang et al. (2022)), (2) self-refinement (Zhang et al. (2023); Wang et al. (2023a)) and (3) external tools (Mao et al. (2023)). This approach does not alter the weights of the model, instead relying on the pretrained knowledge of the LLMs. However, LLMs are trained with text corpus, lacking the understanding of the fine-grained semantics in physical environments. It also suffers from hallucination problems (Rawte et al. (2023)), and difficulties with understanding time-aware actions (Dhingra et al. (2021)). Moreover, the "heavy prompting" approach often proves effective only in small-scale environments, like a predefined room with fixed sets of objects and available actions (Rana et al. (2023)).


A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are widely adopted across various tasks including Question-Answering (QA) tasks. More and more models including the fine-tuned models and the dataset used for fine-tuning are being released daily. This explosion in the number of models, and datasets emphasizes the importance of accurate automatic evaluation for out-of-the-model language model training as well as gauging their QA capabilities. However, varying question types (short-form, long-form, open-ended, etc.) and ambiguity in the grading rubric make it difficult to properly gauge each model's capability objectively for QA tasks. No single existing evaluation metric can capture the language model's QA answer capability for multiple quality types. For example, Exact Match (EM) is a widely adopted all-or-nothing evaluation metric that shows a high correlation with human-evaluated scores for short-form QA tasks but is too strict to give credit for any semantically identical answer. The lack of an objective grading rubric for varying QA types creates a bias in summary statistics. For example, half credit for an open-ended question is regarded equally as half credit for a simple factual question. In this paper, we (1) deploy statistical approaches to characterize various existing evaluation metrics, (2) the effectiveness of recent ChatGPT-o1-preview model [6] as QA grader, and (3) potential solution, a Mixture Of Grader (MOG), which first classifies each (question, gold answer) pair into one of the predefined QA type class and select the appropriate evaluation metric accordingly for an advanced automatic evaluation that better "correlates" to human evaluator.


You Only Use Reactive Attention Slice For Long Context Retrieval

arXiv.org Artificial Intelligence

Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach.


Learning to Maximize Mutual Information for Chain-of-Thought Distillation

arXiv.org Artificial Intelligence

Knowledge distillation, the technique of transferring knowledge from large, complex models to smaller ones, marks a pivotal step towards efficient AI deployment. Distilling Step-by-Step~(DSS), a novel method utilizing chain-of-thought~(CoT) distillation, has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. In DSS, the distilled model acquires the ability to generate rationales and predict labels concurrently through a multi-task learning framework. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. To this end, we investigate the mutual relationship of the two tasks from Information Bottleneck perspective and formulate it as maximizing the mutual information of the representation features of the two tasks. We propose a variational approach to solve this optimization problem using a learning-based method. Our experimental results across four datasets demonstrate that our method outperforms the state-of-the-art DSS. Our findings offer insightful guidance for future research on language model distillation as well as applications involving CoT. Codes are available at \url{https://github.com/xinchen9/cot_distillation_ACL2024}.


GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU

arXiv.org Artificial Intelligence

In recent years, Graph Neural Networks (GNNs) have ignited a surge of innovation, significantly enhancing the processing of geometric data structures such as graphs, point clouds, and meshes. As the domain continues to evolve, a series of frameworks and libraries are being developed to push GNN efficiency to new heights. While graph-centric libraries have achieved success in the past, the advent of efficient tensor compilers has highlighted the urgent need for tensor-centric libraries. Yet, efficient tensor-centric frameworks for GNNs remain scarce due to unique challenges and limitations encountered when implementing segment reduction in GNN contexts. We introduce GeoT, a cutting-edge tensor-centric library designed specifically for GNNs via efficient segment reduction. GeoT debuts innovative parallel algorithms that not only introduce new design principles but also expand the available design space. Importantly, GeoT is engineered for straightforward fusion within a computation graph, ensuring compatibility with contemporary tensor-centric machine learning frameworks and compilers. Setting a new performance benchmark, GeoT marks a considerable advancement by showcasing an average operator speedup of 1.80x and an end-to-end speedup of 1.68x.


Multi-modal Learning for WebAssembly Reverse Engineering

arXiv.org Artificial Intelligence

The increasing adoption of WebAssembly (Wasm) for performance-critical and security-sensitive tasks drives the demand for WebAssembly program comprehension and reverse engineering. Recent studies have introduced machine learning (ML)-based WebAssembly reverse engineering tools. Yet, the generalization of task-specific ML solutions remains challenging, because their effectiveness hinges on the availability of an ample supply of high-quality task-specific labeled data. Moreover, previous works overlook the high-level semantics present in source code and its documentation. Acknowledging the abundance of available source code with documentation, which can be compiled into WebAssembly, we propose to learn representations of them concurrently and harness their mutual relationships for effective WebAssembly reverse engineering. In this paper, we present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering. WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus encompassing source code, code documentation and the compiled WebAssembly, without requiring labeled data. WasmRev incorporates three tailored multi-modal pre-training tasks to capture various characteristics of WebAssembly and cross-modal relationships. WasmRev is only trained once to produce general-purpose representations that can broadly support WebAssembly reverse engineering tasks through few-shot fine-tuning with much less labeled data, improving data efficiency. We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization. Our results show that WasmRev pre-trained on the corpus of multi-modal samples establishes a robust foundation for these tasks, achieving high task accuracy and outperforming the state-of-the-art ML methods for WebAssembly reverse engineering.


Sibyl: Forecasting Time-Evolving Query Workloads

arXiv.org Artificial Intelligence

For workload-based optimization, the input workload plays a crucial role and needs to be a good representation of the expected Database systems often rely on historical query traces to perform workload. Traditionally, historical query traces have been used as workload-based performance tuning. However, real production input workloads with the assumption that workloads are mostly workloads are time-evolving, making historical queries ineffective static. However, as we discuss in 2, many real workloads exhibit for optimizing future workloads. To address this challenge, we propose highly recurring query structures with changing patterns in both Sibyl, an end-to-end machine learning-based framework that their arrival intervals and data accesses. For instance, query templates accurately forecasts a sequence of future queries, with the entire are often shared across users, teams, and applications, but query statements, in various prediction windows. Drawing insights may be customized with different parameter values to access varying from real-workloads, we propose template-based featurization techniques data at different points in time. Consider a log analysis query and develop a stacked-LSTM with an encoder-decoder architecture that reports errors for different devices and error types: "SELECT for accurate forecasting of query workloads. We also * FROM T WHERE deviceType =? AND errorType =? AND develop techniques to improve forecasting accuracy over large prediction eventDate BETWEEN?