Goto

Collaborating Authors

 tpu






TinyCenterSpeed: Efficient Center-Based Object Detection for Autonomous Racing

Reichlin, Neil, Baumann, Nicolas, Ghignone, Edoardo, Magno, Michele

arXiv.org Artificial Intelligence

Perception within autonomous driving is nearly synonymous with Neural Networks (NNs). Yet, the domain of autonomous racing is often characterized by scaled, computationally limited robots used for cost-effectiveness and safety. For this reason, opponent detection and tracking systems typically resort to traditional computer vision techniques due to computational constraints. This paper introduces TinyCenterSpeed, a streamlined adaptation of the seminal CenterPoint method, optimized for real-time performance on 1:10 scale autonomous racing platforms. This adaptation is viable even on OBCs powered solely by Central Processing Units (CPUs), as it incorporates the use of an external Tensor Processing Unit (TPU). We demonstrate that, compared to Adaptive Breakpoint Detector (ABD), the current State-of-the-Art (SotA) in scaled autonomous racing, TinyCenterSpeed not only improves detection and velocity estimation by up to 61.38% but also supports multi-opponent detection and estimation. It achieves real-time performance with an inference time of just 7.88 ms on the TPU, significantly reducing CPU utilization 8.3-fold.


Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Elbtity, Mohammed, Chandarana, Peyton, Zand, Ramtin

arXiv.org Artificial Intelligence

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.


Hardware Acceleration of Explainable Artificial Intelligence

Pan, Zhixin, Mishra, Prabhat

arXiv.org Artificial Intelligence

Machine learning (ML) is successful in achieving human-level artificial intelligence in various fields. However, it lacks the ability to explain an outcome due to its black-box nature. While recent efforts on explainable AI (XAI) has received significant attention, most of the existing solutions are not applicable in real-time systems since they map interpretability as an optimization problem, which leads to numerous iterations of time-consuming complex computations. Although there are existing hardware-based acceleration framework for XAI, they are implemented through FPGA and designed for specific tasks, leading to expensive cost and lack of flexibility. In this paper, we propose a simple yet efficient framework to accelerate various XAI algorithms with existing hardware accelerators. Specifically, this paper makes three important contributions. (1) The proposed method is the first attempt in exploring the effectiveness of Tensor Processing Unit (TPU) to accelerate XAI. (2) Our proposed solution explores the close relationship between several existing XAI algorithms with matrix computations, and exploits the synergy between convolution and Fourier transform, which takes full advantage of TPU's inherent ability in accelerating matrix computations. (3) Our proposed approach can lead to real-time outcome interpretation. Extensive experimental evaluation demonstrates that proposed approach deployed on TPU can provide drastic improvement in interpretation time (39x on average) as well as energy efficiency (69x on average) compared to existing acceleration techniques.


TPU-MLIR: A Compiler For TPU Using MLIR

Hu, Pengchao, Lu, Man, Wang, Lei, Jiang, Guoyue

arXiv.org Artificial Intelligence

Multi-level intermediate representations (MLIR) show great promise for reducing the cost of building domain-specific compilers by providing a reusable and extensible compiler infrastructure. This work presents TPU-MLIR, an end-to-end compiler based on MLIR that deploys pre-trained neural network (NN) models to a custom ASIC called a Tensor Processing Unit (TPU). TPU-MLIR defines two new dialects to implement its functionality: 1. a Tensor operation (TOP) dialect that encodes the deep learning graph semantics and independent of the deep learning framework and 2. a TPU kernel dialect to provide a standard kernel computation on TPU. A NN model is translated to the TOP dialect and then lowered to the TPU dialect for different TPUs according to the chip's configuration. We demonstrate how to use the MLIR pass pipeline to organize and perform optimization on TPU to generate machine code. The paper also presents a verification procedure to ensure the correctness of each transform stage.


Why Google's new AI chip is a big deal

#artificialintelligence

The Google team has developed a new AI model that can design complex chips in just hours. This is an incredibly difficult task and usually takes months for human engineers to accomplish. Let's look into what this new artificial intelligence microchip is and the potential impact it could make in the technology industry. A microchip is a small electronic device that controls and stores electronic data. It is made up of a silicon chip that has been fabricated into a very small size.


Update Alert: TensorFlow 2.8

#artificialintelligence

Google released TensorFlow 2.8 yesterday which adds a few major features and improvements, and a lot of bug fixes and security updates. The main focus of this release is to extend the functionality of TensorFlow Lite. Highlights include more TFLite support for TensorFlow operations; experimental API that Configures TensorFlow ops to run deterministically; PluggableDevice architecture which offers a plugin mechanism for registering devices with TensorFlow without the need to make changes in TensorFlow code; and more. You can view the full list of changes on the TensorFlow GitHub page (and download and install the latest version): TensorFlow 2.8.0 Let's take a closer look at some of these features. TensorFlow Lite (TFLite) is an open source framework included with TensorFlow (essentially a lightweight version of TensorFlow), and is intended for mobile and IoT devices.