The following paper, "Simba: Scaling Deep-Learning Inference with Chiplet-Based Architecture," by Shao et al. presents a scalable deep learning accelerator architecture that tackles issues ranging from chip integration technology to workload partitioning and non-uniform latency effects on deep neural network performance. Through a hardware prototype, they present a timely study of cross-layer issues that will inform next-generation deep learning hardware, software, and neural network architectures. Chip vendors face significant challenges with the continued slowing of Moore's Law causing the time between new technology nodes to increase, sky-rocketing manufacturing costs for silicon, and the end of Dennard scaling. In the absence of device scaling, domain specialization provides an opportunity for architects to deliver more performance and greater energy efficiency. However, domain specialization is an expensive proposition for chip manufacturers.
Most people don't realize that business analysts (BA) are part of the data science team. Yet, their contribution is the most critical part of machine learning operations. They play a translator role between the business stakeholders and the technical team. They specialize in speaking the language of both worlds. BAs help the technical team to break down the business problem into actionable machine learning problems.
RISC-V is, like x86 and ARM, an instruction set architecture (ISA). Unlike x86 and ARM, it is a free and open standard that anyone can use without getting locked into someone else's processor designs or paying costly license fees. Apple's recent move to redesign its Mac computers around chips that it fabricates for itself, replacing Intel, has cast a new spotlight around a class of processor that there's a very good chance you own right now. RISC-V (pronounced RISC-5) is the brainchild of UC Berkeley professors David Patterson and Krste Asanović. Patterson has a talent for catchy acronyms and architectures as a developer of RISC (Reduced Instruction Set Computing) and RAID (Redundant Array of Inexpensive Disks) in the 1980s.
The most part of the computing effort for deep learning inference is based on mathematical operations which can be mostly grouped into the four-part that are convolutions; activations; pooling; and normalization. These all four share a few characteristics that make them well suited for special-purpose hardware implementation: their memory access patterns are extremely predictable & they are readily parallelized. For designing a new custom hardware accelerators for deep learning is clearly popular, but achieving the state-of-the-art performance, and efficiency with a new design is a complex and challenging problem. In order to help developers to advance the adoption of efficient AI inferencing in custom hardware designs, in 2017 Nvidia opened the source for the hardware design of the NVIDIA Deep Learning Accelerator. NVIDIA Deep Learning Accelerator is both scalable and highly configurable; it consists of many great features like the modular design that maintains flexibility & simplifies integration and it also promotes standardized, open architecture to address the computational demands of inference.
We began our Turing Lecture June 4, 201811 with a review of computer architecture since the 1960s. In addition to that review, here, we highlight current challenges and identify future opportunities, projecting another golden age for the field of computer architecture in the next decade, much like the 1980s when we did the research that led to our award, delivering gains in cost, energy, and security, as well as performance. "Those who cannot remember the past are condemned to repeat it."--George Software talks to hardware through a vocabulary called an instruction set architecture (ISA). By the early 1960s, IBM had four incompatible lines of computers, each with its own ISA, software stack, I/O system, and market niche--targeting small business, large business, scientific, and real time, respectively. IBM engineers, including ACM A.M. Turing Award laureate Fred Brooks, Jr., thought they could create a single ISA that would efficiently unify all four of these ISA bases. They needed a technical solution for how computers as inexpensive as those with 8-bit data paths and as fast as those with 64-bit data paths could share a single ISA. The data paths are the "brawn" of the processor in that they perform the arithmetic but are relatively easy to "widen" or "narrow." The greatest challenge for computer designers then and now is the "brains" of the processor--the control hardware. Inspired by software programming, computing pioneer and Turing laureate Maurice Wilkes proposed how to simplify control. Control was specified as a two-dimensional array he called a "control store." Each column of the array corresponded to one control line, each row was a microinstruction, and writing microinstructions was called microprogramming.39 A control store contains an ISA interpreter written using microinstructions, so execution of a conventional instruction takes several microinstructions. The control store was implemented through memory, which was much less costly than logic gates. The table here lists four models of the new System/360 ISA IBM announced April 7, 1964. The data paths vary by a factor of 8, memory capacity by a factor of 16, clock rate by nearly 4, performance by 50, and cost by nearly 6.