field-programmable gate array
Democratizing Domain-Specific Computing
While architecture-guided optimizations and automated program transformation make it much easier to achieve high-performance DSA designs from C/C programs, the software community has introduced various DSLs for better design productivity in certain application domains. One good example is Halide,32 a widely used image-processing DSL, which has the advantageous property of decoupling the algorithm specification from performance optimization (via scheduling statements). This is very useful for image-processing applications; writing image-processing algorithms while parallelizing execution and optimizing for data locality and performance is difficult and time-consuming due to the large number of processing stages and the complex data dependency. However, the plain version of Halide only supports CPUs and GPUs. There is no way to easily synthesize the vast number of Halide programs to DSAs on FPGAs.
The Different Types Of Hardware AI Accelerators
An AI accelerator is a kind of specialised hardware accelerator or computer system created to accelerate artificial intelligence apps, particularly artificial neural networks, machine learning, robotics, and other data-intensive or sensor-driven tasks. They usually have novel designs and typically focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As deep learning and artificial intelligence workloads grew in prominence in the last decade, specialised hardware units were designed or adapted from existing products to accelerate these tasks, and to have parallel high-throughput systems for workstations targeted at various applications, including neural network simulations. As of 2018, a typical AI integrated circuit chip contains billions of MOSFET transistors. Hardware acceleration has many advantages, the main being speed. Accelerators can greatly decrease the amount of time it takes to train and execute an AI model, and can also be used to execute special AI-based tasks that cannot be conducted on a CPU.
Flexible "Brain" for AI Cuts Energy Use by 80%
Scientists at Osaka University built a new computing device from field-programmable gate arrays (FPGA) that can be customized by the user for maximum efficiency in artificial intelligence applications. Compared with currently used rewireable hardware, the system increases circuit density by a factor of 12. Also, it is expected to reduce energy usage by 80%. This advance may lead to flexible artificial intelligence (AI) solutions that provide enhanced performance while consuming much less electricity. AI is becoming a part of everyday life for almost all consumers. However, implementing these algorithms often require a large amount of computing power, which means large electricity bills, as well as big carbon footprints.
FPGA vs GPU for Machine Learning Applications: Which one is better? - Blog - Company - Aldec
FPGAs or GPUs, that is the question. Since the popularity of using machine learning algorithms to extract and process the information from raw data, it has been a race between FPGA and GPU vendors to offer a HW platform that runs computationally intensive machine learning algorithms fast and efficiently. As Deep Learning has driven most of the advanced machine learning applications, it is regarded as the main comparison point. Even though GPU vendors have aggressively positioned their hardware as the most efficient platform for this new era, FPGAs have shown a great improvement in both power consumption and performance in Deep Neural Networks (DNNs) applications, which offer high accuracies for important image classification tasks and are therefore becoming widely adopted [1]. As there are various tradeoffs to consider, it is hard to answer with just a "Yes" or "No".
TinyCNN: A Tiny Modular CNN Accelerator for Embedded FPGA
In recent years, Convolutional Neural Network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are computational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we propose a framework for designing CNN accelerator on embedded FPGA for image classification. The proposed framework provides a tool for FPGA resource-aware design space exploration of CNNs and automatically generates the hardware description of the CNN to be programmed on a target FPGA. The framework consists of three main backends; software, hardware generation, and simulation/precision adjustment. The software backend serves as an API to the designer to design the CNN and train it according to the hardware resources that are available. Using the CNN model, hardware backend generates the necessary hardware components and integrates them to generate the hardware description of the CNN. Finaly, Simulation/precision adjustment backend adjusts the inter-layer precision units to minimize the classification error. We used 16-bit fixed-point data in a CNN accelerator (FPGA) and compared it to the exactly similar software version running on an ARM processor (32-bit floating point data). We encounter about 3% accuracy loss in classification of the accelerated (FPGA) version. In return, we got up to 15.75x speedup by classifying with the accelerated version on the FPGA.
Non-structured DNN Weight Pruning Considered Harmful
Wang, Yanzhi, Ye, Shaokai, He, Zhezhi, Ma, Xiaolong, Zhang, Linfeng, Lin, Sheng, Yuan, Geng, Tan, Sia Huat, Li, Zhengang, Fan, Deliang, Qian, Xuehai, Lin, Xue, Ma, Kaisheng
Large deep neural network (DNN) models pose the key challenge to energy efficiency due to the significantly higher energy consumption of off-chip DRAM accesses than arithmetic or SRAM operations. It motivates the intensive research on model compression with two main approaches. Weight pruning leverages the redundancy in the number of weights and can be performed in a non-structured, which has higher flexibility and pruning rate but incurs index accesses due to irregular weights, or structured manner, which preserves the full matrix structure with lower pruning rate. Weight quantization leverages the redundancy in the number of bits in weights. Compared to pruning, quantization is much more hardware-friendly, and has become a "must-do" step for FPGA and ASIC implementations. This paper provides a definitive answer to the question for the first time. First, we build ADMM-NN-S by extending and enhancing ADMM-NN, a recently proposed joint weight pruning and quantization framework. Second, we develop a methodology for fair and fundamental comparison of non-structured and structured pruning in terms of both storage and computation efficiency. Our results show that ADMM-NN-S consistently outperforms the prior art: (i) it achieves 348x, 36x, and 8x overall weight pruning on LeNet-5, AlexNet, and ResNet-50, respectively, with (almost) zero accuracy loss; (ii) we demonstrate the first fully binarized (for all layers) DNNs can be lossless in accuracy in many cases. These results provide a strong baseline and credibility of our study. Based on the proposed comparison framework, with the same accuracy and quantization, the results show that non-structrued pruning is not competitive in terms of both storage and computation efficiency. Thus, we conclude that non-structured pruning is considered harmful. We urge the community not to continue the DNN inference acceleration for non-structured sparsity.