graphcore
Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML
John, Chelsea Maria, Nassyr, Stepan, Penke, Carolin, Herten, Andreas
Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML Chelsea Maria John, Stepan Nassyr, Carolin Penke, Andreas Herten J ulich Supercomputing Centre F orschungszentrum J ulich J ulich, Germany Abstract --The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail. I NTRODUCTION Fueled by the growing interest in training ever larger deep neural networks, such as large language models and other foundation models, the demands for hardware specialized on these workloads have grown immensely. Graphics processing units (GPUs) have evolved from their origins in computer graphics to become the primary computational engines of the AI revolution. While the central processing unit (CPU) controls a program's execution flow, it offloads compute-intensive highly-parallel tasks to the GPU (the accelerator). Evolving from a pioneering company, NVIDIA has emerged as the dominant player in the market as of 2024, spearheading current hardware developments. Other vendors, such as AMD and Intel, also provide GPUs aiming to accelerate model training and inference. Another promising class of AI accelerators is based on the idea of distributed local per-compute-unit memory together with on-chip message passing, in contrast to a shared memory hierarchy, typical to classical CPUs and GPUs. Performance characteristics not only vary between generations and vendors, but depend on the node or cluster configuration in which the accelerator is embedded, including CPU, memory, and interconnect. When evaluating and comparing these heterogeneous hardware options, e.g. for purchase decisions in an academic or industrial setting, it is not sufficient to compare hardware characteristics such as number of cores, thermal design power (TDP), theoretic bandwidth, or peak performance in F LO P/s . Their effect on workload performance is not straightforward, and the accelerator architectures might barely be comparable. Performance data reflecting the actual intended workloads, collected on various competing systems independently of vendor interests, offer highly valuable information. Power consumption is one such important metric in this regard.
- North America > United States (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- (4 more...)
- Energy (0.67)
- Information Technology (0.59)
SoftBank buys struggling U.K. chip startup Graphcore in AI race
SoftBank Group has acquired British semiconductor startup Graphcore, as the Japanese firm seeks to strengthen its investments in chips and artificial intelligence (AI). The companies announced the deal on Friday without disclosing financial terms. Bristol-based Graphcore will operate as a SoftBank subsidiary and keep its management team, Nigel Toon, Graphcore's chief executive officer, told reporters in a briefing. It's the second U.K. semiconductor company that SoftBank's snapped up, and follows its 2016 takeover of Cambridge-based Arm Holdings, the chip designer whose technology is found in almost all of the world's smartphones. Graphcore was frequently held up as a champion of the U.K. tech industry -- it participated in the country's inaugural AI safety summit last year.
Graphcore Was the UK's AI Champion--Now It's Scrambling to Stay Afloat
Last month, the UK government announced the home for its new exascale supercomputer, designed to give the country an edge in the global artificial intelligence race. The £900 million ($1.1 billion) project would be built in Bristol, a city in the west of England famed for its industrial heritage, and the machine itself would be named after the legendary local engineer, Isambard Kingdom Brunel. The Brunel AI project should have been a big moment for another Bristolian export--Graphcore, one of the UK's only large-scale chipmakers specializing in designing hardware for AI. Valued at $2.5 billion after its last funding round in 2020, the company is trying to offer an alternative to the US giant Nvidia, which dominates the market. With AI fast becoming an issue of geopolitical as well as commercial importance, and countries--including the UK--spending hundreds of millions of dollars on building strategic reserves of chips and investing in massive supercomputers, companies like Graphcore should be poised to benefit.
PopSparse: Accelerated block sparse matrix multiplication on IPU
Li, Zhiyi, Orr, Douglas, Ohan, Valeriu, Da costa, Godfrey, Murray, Tom, Sanders, Adam, Beker, Deniz, Masters, Dominic
Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In this work we introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels with large matrix size and block size. Furthermore, static sparsity in general outperforms dynamic sparsity. While previous work on GPAs has shown speedups only for very high sparsity (typically 99% and above), the present work demonstrates that our static sparse implementation outperforms equivalent dense calculations in FP16 at lower sparsity (around 90%). IPU code is available to view and run at ipu.dev/sparsity-benchmarks, GPU code will be made available shortly. The topic of sparsity has gained significant attention in the field of deep learning research due to its potential for increased computational efficiency, reduced model size, and closer alignment with brain-like computation. The notion of sparsity in deep learning most commonly refers to the idea of sparsifying the model weights with the aim of reducing the associated storage and compute costs.
GPS++: An Optimised Hybrid MPNN/Transformer for Molecular Property Prediction
Masters, Dominic, Dean, Josef, Klaser, Kerstin, Li, Zhiyi, Maddrell-Mander, Sam, Sanders, Adam, Helal, Hatem, Beker, Deniz, Rampášek, Ladislav, Beaini, Dominique
This technical report presents GPS++, the first-place solution to the Open Graph Benchmark Large-Scale Challenge (OGB-LSC 2022) for the PCQM4Mv2 molecular property prediction task. Our approach implements several key principles from the prior literature. At its core our GPS++ method is a hybrid MPNN/Transformer model that incorporates 3D atom positions and an auxiliary denoising task. The effectiveness of GPS++ is demonstrated by achieving 0.0719 mean absolute error on the independent test-challenge PCQM4Mv2 split. Thanks to Graphcore IPU acceleration, GPS++ scales to deep architectures (16 layers), training at 3 minutes per epoch, and large ensemble (112 models), completing the final predictions in 1 hour 32 minutes, well under the 4 hour inference budget allocated. Our implementation is publicly available at: https://github.com/graphcore/ogb-lsc-pcqm4mv2.
- Europe > United Kingdom > England > Bristol (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
In latest benchmark test of AI, it's mostly Nvidia competing against Nvidia
For lack of rich competition, some of Nvidia's most significant results in the latest MLPerf were against itself, comparing its newest GPU, H100 "Hopper," to its existing product, the A100. Although chip giant Nvidia tends to cast a long shadow over the world of artificial intelligence, its ability to simply drive competition out of the market may be increasing, if the latest benchmark test results are any indication. Did you miss out on Black Friday 2022? No problem: Cyber Monday deals are here, with internet retailers offering their lowest prices of the year. ZDNET is surfacing the latest and best sales online in real time for you to check out now.
BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion
Cattaneo, Alberto, Justus, Daniel, Mellor, Harry, Orr, Douglas, Maloberti, Jerome, Liu, Zhenying, Farnsworth, Thorin, Fitzgibbon, Andrew, Banaszewski, Blazej, Luschi, Carlo
We present the award-winning submission to the WikiKG90Mv2 track of OGB-LSC@NeurIPS 2022. The task is link-prediction on the large-scale knowledge graph WikiKG90Mv2, consisting of 90M+ nodes and 600M+ edges. Our solution uses a diverse ensemble of $85$ Knowledge Graph Embedding models combining five different scoring functions (TransE, TransH, RotatE, DistMult, ComplEx) and two different loss functions (log-sigmoid, sampled softmax cross-entropy). Each individual model is trained in parallel on a Graphcore Bow Pod$_{16}$ using BESS (Balanced Entity Sampling and Sharing), a new distribution framework for KGE training and inference based on balanced collective communications between workers. Our final model achieves a validation MRR of 0.2922 and a test-challenge MRR of 0.2562, winning the first place in the competition. The code is publicly available at: https://github.com/graphcore/distributed-kge-poplar/tree/2022-ogb-submission.
Pienso and Graphcore empower business with deeper, faster AI insights
Graphcore is continuing to build out its AI applications and services ecosystem, launching a new partnership with AI platform company Pienso to make its powerful text analysis solution available on IPUs in the cloud. Pienso uses natural language processing to help businesses extract actionable insights from written text such as comments posted on social media, transcripts of customer service phone calls, news articles and documents. Pienso Graphcore is aimed at enterprise users such as media and entertainment companies, consumer internet - including social networks and e-commerce businesses, telecoms providers, and anyone trying to get high quality, high speed insights from large amounts of written data. No coding or ML skills are needed to build and run models in Pienso, meaning it can be used by subject matter experts and strategic decision makers within a business, removing reliance on in-demand AI engineers. Thanks to the IPU's designed-for-AI architecture and world leading performance in Natural Language Processing, Pienso runs considerably faster and with finer granularity and precision on IPUs than on other compute platforms; a performance gain that makes an already powerful solution truly transformative for its users.
- North America (0.05)
- Europe (0.05)
Graphcore talks Scaling up AI on Weights and Biases Podcast
Machine intelligence is a unique computational workload with distinctly different characteristics to HPC algorithms or graphics programs. With the slowing down of Moore's Law and model sizes on the rise, there is a need for specialised machine learning hardware designed to run AI workloads efficiently. Phil Brown, Graphcore's Director of Applications, recently spoke to Founder of Weights & Biases, Lukas Biewald, about the role of AI processors such as the IPU in driving forward progress in machine intelligence, from enabling sparsity to accelerating BERT. Pursuing new approaches to machine learning can be a challenge, particularly once AI workloads move from pilot to production. At scale, even a slight drop in performance can be costly.
Introducing Graphcloud: Graphcore's MK2 IPU-POD AI cloud service with Cirrascale
Today, Graphcore is proud to take the next step in our commitment to helping customers accelerate their innovation and harness the power of AI at scale. Together with Cirrascale Cloud Services, we have built something totally new for AI in the cloud, with the first publicly available Mk2 IPU-POD scale-out cluster, offering a simple way to add compute capacity on-demand, without the need to own and operate a datacentre. We recognise that the tremendous opportunity offered by AI brings with it a unique set of computing challenges; model size is growing rapidly, and the bar for accuracy is constantly being raised. If customers are to take full advantage of the latest innovations, they need a tightly integrated hardware and software system built specifically for artificial intelligence. Graphcloud is a secure and reliable IPU-POD family cloud service that allows customers to access the power of Graphcore's Intelligence Processing Unit (IPU), as they scale from experimentation, proof of concept and pilot projects to larger production systems.