Goto

Collaborating Authors

 tpus


Why Google's custom AI chips are shaking up the tech industry

New Scientist

Why Google's custom AI chips are shaking up the tech industry Ironwood is Google's latest tensor processing unit Nvidia's position as the dominant supplier of AI chips may be under threat from a specialised chip pioneered by Google, with reports suggesting companies like Meta and Anthropic are looking to spend billions on Google's tensor processing units. The success of the artificial intelligence industry has been in large part based on graphical processing units (GPUs), a kind of computer chip that can perform many parallel calculations at the same time, rather than one after the other like the computer processing units (CPUs) that power most computers. 'Flashes of brilliance and frustration': I let an AI agent run my day GPUs were originally developed to assist with computer graphics, as the name suggests, and gaming. "If I have a lot of pixels in a space and I need to do a rotation of this to calculate a new camera view, this is an operation that can be done in parallel, for many different pixels," says Francesco Conti at the University of Bologna in Italy. This ability to do calculations in parallel happened to be useful for training and running AI models, which often use calculations involving vast grids of numbers performed at the same time, called matrix multiplication.


Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs

Zhu, Zhantong, Li, Hongou, Ren, Wenjie, Wu, Meng, Ye, Le, Huang, Ru, Jia, Tianyu

arXiv.org Artificial Intelligence

--With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. T ensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3 reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture. Generative models, such as large language models (LLMs) and diffusion models (DMs), have exhibited exceptional performance in generating content across various modalities. For example, LLMs have dominated NLP tasks, powering applications like ChatGPT [1].


Life-Cycle Emissions of AI Hardware: A Cradle-To-Grave Approach and Generational Trends

Schneider, Ian, Xu, Hui, Benecke, Stephan, Patterson, David, Huang, Keguo, Ranganathan, Parthasarathy, Elsworth, Cooper

arXiv.org Artificial Intelligence

Specialized hardware accelerators aid the rapid advancement of artificial intelligence (AI), and their efficiency impacts AI's environmental sustainability. This study presents the first publication of a comprehensive AI accelerator life-cycle assessment (LCA) of greenhouse gas emissions, including the first publication of manufacturing emissions of an AI accelerator. Our analysis of five Tensor Processing Units (TPUs) encompasses all stages of the hardware lifespan - from raw material extraction, manufacturing, and disposal, to energy consumption during development, deployment, and serving of AI models. Using first-party data, it offers the most comprehensive evaluation to date of AI hardware's environmental impact. We include detailed descriptions of our LCA to act as a tutorial, road map, and inspiration for other computer engineers to perform similar LCAs to help us all understand the environmental impacts of our chips and of AI. A byproduct of this study is the new metric compute carbon intensity (CCI) that is helpful in evaluating AI hardware sustainability and in estimating the carbon footprint of training and inference. This study shows that CCI improves 3x from TPU v4i to TPU v6e. Moreover, while this paper's focus is on hardware, software advancements leverage and amplify these gains.


Scalable Machine Learning Training Infrastructure for Online Ads Recommendation and Auction Scoring Modeling at Google

Kurian, George, Sardashti, Somayeh, Sims, Ryan, Berger, Felix, Holt, Gary, Li, Yang, Willcock, Jeremiah, Wang, Kaiyuan, Quiroz, Herve, Salem, Abdulrahman, Grady, Julian

arXiv.org Artificial Intelligence

Large-scale Ads recommendation and auction scoring models at Google scale demand immense computational resources. While specialized hardware like TPUs have improved linear algebra computations, bottlenecks persist in large-scale systems. This paper proposes solutions for three critical challenges that must be addressed for efficient end-to-end execution in a widely used production infrastructure: (1) Input Generation and Ingestion Pipeline: Efficiently transforming raw features (e.g., "search query") into numerical inputs and streaming them to TPUs; (2) Large Embedding Tables: Optimizing conversion of sparse features into dense floating-point vectors for neural network consumption; (3) Interruptions and Error Handling: Minimizing resource wastage in large-scale shared datacenters. To tackle these challenges, we propose a shared input generation technique to reduce computational load of input generation by amortizing costs across many models. Furthermore, we propose partitioning, pipelining, and RPC (Remote Procedure Call) coalescing software techniques to optimize embedding operations. To maintain efficiency at scale, we describe novel preemption notice and training hold mechanisms that minimize resource wastage, and ensure prompt error resolution. These techniques have demonstrated significant improvement in Google production, achieving a 116% performance boost and an 18% reduction in training costs across representative models.


On-Device LLMs for SMEs: Challenges and Opportunities

Yee, Jeremy Stephen Gabriel, Ng, Pai Chet, Wang, Zhengkui, McLoughlin, Ian, Ng, Aik Beng, See, Simon

arXiv.org Artificial Intelligence

This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for effective deployment, addressing the challenges of limited computational resources typical in SME settings. From the software perspective, we explore framework compatibility, operating system optimization, and the use of specialized libraries tailored for resource-constrained environments. The review is structured to first identify the unique challenges faced by SMEs in deploying LLMs on-device, followed by an exploration of the opportunities that both hardware innovations and software adaptations offer to overcome these obstacles. Such a structured review provides practical insights, contributing significantly to the community by enhancing the technological resilience of SMEs in integrating LLMs.


Reviews: Task-Driven Convolutional Recurrent Models of the Visual System

Neural Information Processing Systems

Post author feedback: I am very impressed by the fits at the bottom of the response. There was some discussion amongst the reviewers concerning the relationship between this and what is known about the actual circuits (e.g., inputs arrive to layers 4 and 5, then from layer 4 signals go to layers 2/3, etc.). It would be useful for the authors to relate this to those facts. Also, we discussed whether your model actually fits the data about the quantity of feedback vs. feedforward connections (as much or more feedback as feedforward). It would be useful to inform the reader as to whether your model accounts for this as well.


A Partial Replication of MaskFormer in TensorFlow on TPUs for the TensorFlow Model Garden

Purohit, Vishal, Jiang, Wenxin, Ravikiran, Akshath R., Davis, James C.

arXiv.org Artificial Intelligence

This paper undertakes the task of replicating the MaskFormer model -- a universal image segmentation model -- originally developed using the PyTorch framework, within the TensorFlow ecosystem, specifically optimized for execution on Tensor Processing Units (TPUs). Our implementation exploits the modular constructs available within the TensorFlow Model Garden (TFMG), encompassing elements such as the data loader, training orchestrator, and various architectural components, tailored and adapted to meet the specifications of the MaskFormer model. We address key challenges encountered during the replication, non-convergence issues, slow training, adaptation of loss functions, and the integration of TPU-specific functionalities. We verify our reproduced implementation and present qualitative results on the COCO dataset. Although our implementation meets some of the objectives for end-to-end reproducibility, we encountered challenges in replicating the Py-Torch version of MaskFormer in TensorFlow. This replication process is not straightforward and requires substantial engineering efforts.


Exploration of TPUs for AI Applications

Carrión, Diego Sanmartín, Prohaska, Vera

arXiv.org Artificial Intelligence

Tensor Processing Units (TPUs) are specialized hardware accelerators for deep learning developed by Google. This paper aims to explore TPUs in cloud and edge computing focusing on its applications in AI. We provide an overview of TPUs, their general architecture, specifically their design in relation to neural networks, compilation techniques and supporting frameworks. Furthermore, we provide a comparative analysis of Cloud and Edge TPU performance against other counterpart chip architectures. Our results show that TPUs can provide significant performance improvements in both cloud and edge computing. Additionally, this paper underscores the imperative need for further research in optimization techniques for efficient deployment of AI architectures on the Edge TPU and benchmarking standards for a more robust comparative analysis in edge computing scenarios. The primary motivation behind this push for research is that efficient AI acceleration, facilitated by TPUs, can lead to substantial savings in terms of time, money, and environmental resources.


HUGE: Huge Unsupervised Graph Embeddings with TPUs

Mayer, Brandon, Tsitsulin, Anton, Fichtenberger, Hendrik, Halcrow, Jonathan, Perozzi, Bryan

arXiv.org Artificial Intelligence

Graphs are a representation of structured data that captures the relationships between sets of objects. With the ubiquity of available network data, there is increasing industrial and academic need to quickly analyze graphs with billions of nodes and trillions of edges. A common first step for network understanding is Graph Embedding, the process of creating a continuous representation of nodes in a graph. A continuous representation is often more amenable, especially at scale, for solving downstream machine learning tasks such as classification, link prediction, and clustering. A high-performance graph embedding architecture leveraging Tensor Processing Units (TPUs) with configurable amounts of high-bandwidth memory is presented that simplifies the graph embedding problem and can scale to graphs with billions of nodes Figure 1: HUGE can learn representations on extremely large and trillions of edges. We verify the embedding space quality on graphs (billions of nodes) at Google.


Google touts AI supercomputer; Nvidia tops MLPerf 3.0 tests

#artificialintelligence

The war of words among AI supercomputer vendors escalated this week with Google claiming that its TPU-based system is faster and more efficient than Nvidia's A100-based entry, according to its own testing. Nvidia countered that its H100 system is faster based on testing conducted by the independent MLCommons using MLPerf 3.0. Google researchers reported that its Tensor Processing Unit-based supercomputer v4 is 1.2 to 1.7 times faster than Nvidia's 3-year-old A100 system and uses between 1.3 to 1.9 times less power. The MLPerf 3.0 benchmarks measured Nvidia's newer H100 against systems entered by 25 organizations, but Google's TPU-based v4 system was not one of them. A direct system-to-system comparison of the two companies' latest systems would have to be conducted by an independent organization running a variety of AI-based workloads for any benchmarks to be definitive, analysts said.