Goto

Collaborating Authors

 grayskull


Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

arXiv.org Artificial Intelligence

The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull's execution model, grid size, matrix dimensions, data formats, and numerical precision impact on computational efficiency. Furthermore, we compare Grayskull's performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.


Attention in SRAM on Tenstorrent Grayskull

arXiv.org Artificial Intelligence

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.


Another deep learning processor appears in the ring: Grayskull from Tenstorrent

#artificialintelligence

It describes the technology behind the processor as: "The first conditional execution architecture for artificial intelligence facilitating scalable deep learning. Tenstorrent has taken an approach that dynamically eliminates unnecessary computation, thus breaking the direct link between model size growth and compute/memory bandwidth requirements." "Conditional computation enables adaptation to both inference and training of a model to the exact input that was presented, like adjusting NLP model computations to the exact length of the text presented, and dynamically pruning portions of the model based on input characteristics," is how the company describes it. It has eight channels of LPDDR4 for supporting up to 16Gbyte of external DRAM and 16 lanes of PCI-E Gen 4. The Tensix cores have a packet processor, a programmable SIMD and maths computation block, five single-issue RISC cores and 1Mbyte of ram. "The array of Tensix cores is stitched together with a double 2D torus network-on-chip, which facilitates multi-cast flexibility, along with minimal software burden for scheduling coarse-grain data transfers," according to the company.