memory transfer
Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices
Daghero, Francesco, Burrello, Alessio, Poncino, Massimo, Macii, Enrico, Pagliari, Daniele Jahier
Depthwise separable convolutions are a fundamental component in efficient Deep Neural Networks, as they reduce the number of parameters and operations compared to traditional convolutions while maintaining comparable accuracy. However, their low data reuse opportunities make deploying them notoriously difficult. In this work, we perform an extensive exploration of alternatives to fuse the depthwise and pointwise kernels that constitute the separable convolutional block. Our approach aims to minimize time-consuming memory transfers by combining different data layouts. When targeting a commercial ultra-low-power device with a three-level memory hierarchy, the GreenWaves GAP8 SoC, we reduce the latency of end-to-end network execution by up to 11.40%. Furthermore, our kernels reduce activation data movements between L2 and L1 memories by up to 52.97%.
Optimized Deployment of Deep Neural Networks for Visual Pose Estimation on Nano-drones
Risso, Matteo, Daghero, Francesco, Motetti, Beatrice Alessandra, Pagliari, Daniele Jahier, Macii, Enrico, Poncino, Massimo, Burrello, Alessio
Miniaturized autonomous unmanned aerial vehicles (UAVs) are gaining popularity due to their small size, enabling new tasks such as indoor navigation or people monitoring. Nonetheless, their size and simple electronics pose severe challenges in implementing advanced onboard intelligence. This work proposes a new automatic optimization pipeline for visual pose estimation tasks using Deep Neural Networks (DNNs). The pipeline leverages two different Neural Architecture Search (NAS) algorithms to pursue a vast complexity-driven exploration in the DNNs' architectural space. The obtained networks are then deployed on an off-the-shelf nano-drone equipped with a parallel ultra-low power System-on-Chip leveraging a set of novel software kernels for the efficient fused execution of critical DNN layer sequences. Our results improve the state-of-the-art reducing inference latency by up to 3.22x at iso-error.
SparQ Attention: Bandwidth-Efficient LLM Inference
Ribar, Luka, Chelombiev, Ivan, Hudlass-Galley, Luke, Blake, Charlie, Luschi, Carlo, Orr, Douglas
Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.
G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks
Wan, Zhongwei, Yin, Yichun, Zhang, Wei, Shi, Jiaxin, Shang, Lifeng, Chen, Guangyong, Jiang, Xin, Liu, Qun
Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this Domain-Adaptive Pre-Training (DAPT; Gururangan et al. (2020)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of General Memory Augmented Pre-trained Language Model (G-MAP), which augments the domain-specific PLM by a memory representation built from the frozen general PLM without losing any general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmented strategies are explored to build the memory representation and then adaptively fuse it into the domain-specific PLM. We demonstrate the effectiveness of G-MAP on various domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed G-MAP can achieve SOTA results on all tasks.
Solving Machine Learning Performance Anti-Patterns: a Systematic Approach
These principles are in rough order of priority, and like all guidelines there are times they should be broken. Next we'll take a tour through some major patterns of suboptimal performance -- many of which map directly to violations of these principles. Machine learning systems show distinct patterns of resource consumption, and each of these patterns requires a different approach to improving performance. Real-world systems usually exhibit several different patterns in different parts of the inference pipeline so quite often we'll need to apply multiple of the approaches below. For example, post-processing logic is highly prone to being CPU compute bound or synchronization bound, whereas the backbone of vision models are often GPU compute bound.
Object Detection from 9 FPS to 650 FPS in 6 Steps
Making code run fast on GPUs requires a very different approach to making code run fast on CPUs because the hardware architecture is fundamentally different. If you come from a background of efficient coding on CPU then you'll have to adjust some assumptions about what patterns are best. Machine learning engineers of all kinds should care about squeezing performance from their models and hardware -- not just for production purposes, but also for research and training. In research as in development, a fast iteration loop leads to faster improvement. This article is a practical deep dive into making a specific deep learning model (Nvidia's SSD300) run fast on a powerful GPU server, but the general principles apply to all GPU programming.
Generative Memory for Lifelong Reinforcement Learning
Raghavan, Aswin, Hostetler, Jesse, Chai, Sek
Our research is focused on understanding and applying biological memory transfers to new AI systems that can fundamentally improve their performance, throughout their fielded lifetime experience. We leverage current understanding of biological memory transfer to arrive at AI algorithms for memory consolidation and replay. In this paper, we propose the use of generative memory that can be recalled in batch samples to train a multi-task agent in a pseudo-rehearsal manner. We show results motivating the need for task-agnostic separation of latent space for the generative memory to address issues of catastrophic forgetting in lifelong learning.
Scientists sucked a memory out of a snail and stuck it in another snail
Aplysia californica, also known as the California sea hare Credit: Genny Anderson/CC by 4.0 A new study strongly suggests that at least some memories are stored in genetic code, and that genetic code can act like memory soup. Suck it out of one animal and stick the code in a second animal, and that second animal can remember things that only the first animal knew. That might sound like science fiction or remind some readers of debunked ideas from decades past. But it's serious science: In a new study, researchers at the University of California, Los Angeles (UCLA) extracted RNA, a genetic messenger molecule, from one snail and implanted it in another snail. In both experiments, the recipient -- either the snail or the petri-neurons -- remembered something the donor snail had experienced.