Blott, Michaela
Improving Quantization with Post-Training Model Expansion
Franco, Giuseppe, Monteagudo-Lago, Pablo, Colbert, Ian, Fraser, Nicholas, Blott, Michaela
The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the zero-shot accuracy gap to full precision by an average of 3% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model.
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics
Que, Zhiqiang, Fan, Hongxiang, Loo, Marcus, Li, He, Blott, Michaela, Pierini, Maurizio, Tapper, Alexander, Luk, Wayne
This work presents a novel reconfigurable architecture for Low Latency Graph Neural Network (LL-GNN) designs for particle detectors, delivering unprecedented low latency performance. Incorporating FPGA-based GNNs into particle detectors presents a unique challenge since it requires sub-microsecond latency to deploy the networks for online event selection with a data rate of hundreds of terabytes per second in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a novel outer-product based matrix multiplication approach, which is enhanced by exploiting the structured adjacency matrix and a column-major data layout. Moreover, a fusion step is introduced to further reduce the end-to-end design latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under given latency constraints. To facilitate this, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 9.0 times faster and achieves up to 13.1 times higher power efficiency than a GPU implementation. Compared to the previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy. The proposed LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
ACCL+: an FPGA-Based Collective Engine for Distributed Applications
He, Zhenhao, Korolija, Dario, Zhu, Yu, Ramhorst, Benjamin, Laan, Tristan, Petrica, Lucian, Blott, Michaela, Alonso, Gustavo
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs
Aggarwal, Shivam, Pappalardo, Alessandro, Damsgaard, Hans Jakob, Franco, Giuseppe, Preußer, Thomas B., Blott, Michaela, Mitra, Tulika
Post-Training Quantization (PTQ) is a powerful technique for model compression, reducing the precision of neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point quantization (FP8) in the context of PTQ for model inference. However, the exploration of floating-point formats smaller than 8 bits and their comparison with integer quantization remains relatively limited. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. Our work presents a novel PTQ design-space exploration, comparing minifloat and integer quantization schemes across a range of 3 to 8 bits for both weights and activations. We examine the applicability of various PTQ techniques to minifloats, including weight equalization, bias correction, SmoothQuant, gradient-based learned rounding, and the GPTQ method. Our experiments validate the effectiveness of low-precision minifloats when compared to their integer counterparts across a spectrum of accuracy-precision trade-offs on a set of reference deep learning vision workloads. Finally, we evaluate our results against an FPGA-based hardware cost model, showing that integer quantization often remains the Pareto-optimal option, given its relatively smaller hardware resource footprint.
Implementing Neural Network-Based Equalizers in a Coherent Optical Transmission System Using Field-Programmable Gate Arrays
Freire, Pedro J., Srivallapanondh, Sasipim, Anderson, Michael, Spinnler, Bernhard, Bex, Thomas, Eriksson, Tobias A., Napoli, Antonio, Schairer, Wolfgang, Costa, Nelson, Blott, Michaela, Turitsyn, Sergei K., Prilepsky, Jaroslaw E.
In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 200G and 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
QONNX: Representing Arbitrary-Precision Quantized Neural Networks
Pappalardo, Alessandro, Umuroglu, Yaman, Blott, Michaela, Mitrevski, Jovan, Hawks, Ben, Tran, Nhan, Loncar, Vladimir, Summers, Sioni, Borras, Hendrik, Muhizi, Jules, Trahms, Matthew, Hsu, Shih-Chieh, Hauck, Scott, Duarte, Javier
We present extensions to the Open Neural Network Exchange (ONNX) intermediate representation format to represent arbitrary-precision quantized neural networks. We first introduce support for low precision quantization in existing ONNX-based quantization formats by leveraging integer clipping, resulting in two new backward-compatible variants: the quantized operator format with clipping and quantize-clip-dequantize (QCDQ) format. We then introduce a novel higher-level ONNX format called quantized ONNX (QONNX) that introduces three new operators -- Quant, BipolarQuant, and Trunc -- in order to represent uniform quantization. By keeping the QONNX IR high-level and flexible, we enable targeting a wider variety of platforms. We also present utilities for working with QONNX, as well as examples of its usage in the FINN and hls4ml toolchains. Finally, we introduce the QONNX model zoo to share low-precision quantized neural networks.
Applications and Techniques for Fast Machine Learning in Science
Deiana, Allison McCarn, Tran, Nhan, Agar, Joshua, Blott, Michaela, Di Guglielmo, Giuseppe, Duarte, Javier, Harris, Philip, Hauck, Scott, Liu, Mia, Neubauer, Mark S., Ngadiuba, Jennifer, Ogrenci-Memik, Seda, Pierini, Maurizio, Aarrestad, Thea, Bahr, Steffen, Becker, Jurgen, Berthold, Anne-Sophie, Bonventre, Richard J., Bravo, Tomas E. Muller, Diefenthaler, Markus, Dong, Zhen, Fritzsche, Nick, Gholami, Amir, Govorkova, Ekaterina, Hazelwood, Kyle J, Herwig, Christian, Khan, Babar, Kim, Sehoon, Klijnsma, Thomas, Liu, Yaling, Lo, Kin Ho, Nguyen, Tri, Pezzullo, Gianantonio, Rasoulinezhad, Seyedramin, Rivera, Ryan A., Scholberg, Kate, Selig, Justin, Sen, Sougata, Strukov, Dmitri, Tang, William, Thais, Savannah, Unger, Kai Lukas, Vilalta, Ricardo, Krosigk, Belinavon, Warburton, Thomas K., Flechas, Maria Acosta, Aportela, Anthony, Calvet, Thomas, Cristella, Leonardo, Diaz, Daniel, Doglioni, Caterina, Galati, Maria Domenica, Khoda, Elham E, Fahim, Farah, Giri, Davide, Hawks, Benjamin, Hoang, Duc, Holzman, Burt, Hsu, Shih-Chieh, Jindariani, Sergo, Johnson, Iris, Kansal, Raghav, Kastner, Ryan, Katsavounidis, Erik, Krupa, Jeffrey, Li, Pan, Madireddy, Sandeep, Marx, Ethan, McCormack, Patrick, Meza, Andres, Mitrevski, Jovan, Mohammed, Mohammed Attia, Mokhtar, Farouk, Moreno, Eric, Nagu, Srishti, Narayan, Rohin, Palladino, Noah, Que, Zhiqiang, Park, Sang Eon, Ramamoorthy, Subramanian, Rankin, Dylan, Rothman, Simon, Sharma, Ashish, Summers, Sioni, Vischia, Pietro, Vlimant, Jean-Roch, Weng, Olivia
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.