Goto

Collaborating Authors

 bitvector



SBVR: Summation of BitVector Representation for Efficient LLM Quantization

arXiv.org Artificial Intelligence

With the advent of large language models (LLMs), numerous Post-Training Quantization (PTQ) strategies have been proposed to alleviate deployment barriers created by their enormous parameter counts. Quantization achieves compression by limiting the number of representable points in the data. Therefore, the key to achieving efficient quantization is selecting the optimal combination of representation points, or codes, for the given data. Existing PTQ solutions adopt two major approaches to this problem: Round-To-Nearest (RTN)-based methods and codebook-based methods. RTN-based methods map LLM weights onto uniformly distributed integer grids, failing to account for the Gaussian-like weight distribution of LLM weights. Codebook-based methods mitigate this issue by constructing distribution-aware codebooks; however, they suffer from random and strided memory access patterns, resulting in degraded inference speed that is exacerbated by the limited size of GPU L1 cache. To overcome these limitations, we propose a novel LLM quantization method, SBVR (Summation of BitVector Representation), that enables Gaussian-like code representation in a hardware-friendly manner for fast inference. SBVR maps weight values to non-uniform representation points whose distribution follows the actual distribution of LLM weights, enabling more accurate compression. Additionally, we design a custom CUDA kernel that allows matrix-vector multiplication directly in the SBVR format without decompression, thereby enabling high-performance execution of SBVR-compressed models. Our evaluations of SBVR on various models demonstrate state-of-the-art perplexity and accuracy benchmark performance while delivering a 2.21x- 3.04x end-to-end token-generation speedup over naive FP16 models in the 4-bit quantization regime.


Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) for code applications have demonstrated remarkable zero-shot fluency and instruction following on challenging code related tasks ranging from test case generation to self-repair. Unsurprisingly, however, models struggle to compose syntactically valid programs in programming languages unrepresented in pre-training, referred to as very low-resource Programming Languages (VLPLs). VLPLs appear in crucial settings, including domain-specific languages for internal tools and tool-chains for legacy languages. Inspired by an HCI technique called natural program elicitation, we propose designing an intermediate language that LLMs ``naturally'' know how to use and which can be automatically compiled to a target VLPL. When LLMs generate code that lies outside of this intermediate language, we use compiler techniques to repair the code into programs in the intermediate language. Overall, we introduce \emph{synthetic programming elicitation and compilation} (SPEAC), an approach that enables LLMs to generate syntactically valid code even for VLPLs. We empirically evaluate the performance of SPEAC in a case study and find that, compared to existing retrieval and fine-tuning baselines, SPEAC produces syntactically correct programs significantly more frequently without sacrificing semantic correctness.



LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

Abstract--Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent reinforcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. MARL, which are 7.13 higher and 12.43 more energy efficient Most importantly, the accelerator shows speedup up to 12.52 for MARL requires up to 942.9 GFLOPS for effective realtime In addition, as the MARL system is I. Current CPU and GPU-based systems cannot learning, known for solving long-term decision-making problems meet the above requirements due to the lack of computing effectively. It aims to train the action policy, which is units, high power consumption or low utilization for small about how an agent should take actions based on the feedback batch sizes. Instead, FPGA is emerging as a new solution for from the given environment to maximize cumulative rewards. For example, Recently, deep reinforcement learning (DRL) that utilizes a the Xilinx U280 acceleration card provides robust computing deep neural network (DNN) as an action policy has been proposed potential through 9,024 DSPs over 41MB of on-chip BRAM [1]-[4]. Although DRL stands out in various domains while showing less power consumption than GPU. In addition, such as industrial control and robotics [5]-[7], all of them the reconfigurability of FPGA allows the optimization of are limited to a single agent. Other significant applications irregular data access and parallelism with customized compact have started to employ interaction between multiple agents, for data format, where these hardware overhead occurs in network instance, analysis of language communication and the network pruning to handle computation-bound applications. Hence, extending DRL to have In this paper, we propose a FPGA-based acceleration system many agents is critical for developing intelligent systems named LearningGroup, to yield high performance for where agents can interact with each other or even with people.


Rethink Decision Tree Traversal

arXiv.org Artificial Intelligence

QuickScorer[12] and RapidScorer[21] are proposed based on bit-vectors of the false nodes in order to speed up the additive ensemble of regression trees in learning to rank. Inspired by [12], more works, such as [2; 11; 13; 15], focus on the application and acceleration of additive tree models while we will pay attention to the theory of algorithms specially the representation of binary decision tree in the language of matrix computation. Based on so-called Tree Supervision Loss, a hierarchical classifier is built from the weights of the softmax layer in convolutional neural networks in [18]. In [20; 19], tree regularization is used to enhance the interpretability of deep neural networks. A generalized tree representation termed TART is based on transition matrix shown in [22].


Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale

Neural Information Processing Systems

Deep distributed decision trees and tree ensembles have grown in importance due to the need to model increasingly large datasets. However, PLANET, the standard distributed tree learning algorithm implemented in systems such as \xgboost and Spark MLlib, scales poorly as data dimensionality and tree depths grow. We present Yggdrasil, a new distributed tree learning method that outperforms existing methods by up to 24x. Unlike PLANET, Yggdrasil is based on vertical partitioning of the data (i.e., partitioning by feature), along with a set of optimized data structures to reduce the CPU and communication costs of training. Yggdrasil (1) trains directly on compressed data for compressible features and labels; (2) introduces efficient data structures for training on uncompressed data; and (3) minimizes communication between nodes by using sparse bitvectors. Moreover, while PLANET approximates split points through feature binning, Yggdrasil does not require binning, and we analytically characterize the impact of this approximation. We evaluate Yggdrasil against the MNIST 8M dataset and a high-dimensional dataset at Yahoo; for both, Yggdrasil is faster by up to an order of magnitude.


GPU Exploration of Two-Player Games with Perfect Hash Functions

AAAI Conferences

In this paper we improve solving two-player games by computing the game-theoretical value of every reachable state. A graphics processing unit located on the graphics card is used as a co-processor to accelerate the solution process. We exploit perfect hash functions to store the game states efficiently in memory and to transfer their ordinal representation between the host and the graphics card. As an application we validate Gasser's results that Nine-Men-Morris is a draw on a personal computer. Moreover, our solution is strong, while for the opening phase Gasser only provided a weak solution.


Perfect Hashing for State Space Exploration on the GPU

AAAI Conferences

This paper exploits parallel computing power of graphics cards to accelerate state space search. We illustrate that modern graphics processing units (GPUs) have the potential to speed up breadth-first search significantly. For a bitvector representation of the search frontier, GPU algorithms with one and two bits per state are presented. Efficient perfect hash functions and their inverse are explored in order to achieve enhanced compression. We report maximal speed-ups of up to a factor of 27 wrt. single core CPU computation.