Gope, Dibakar, Beu, Jesse, Mattina, Matthew

Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of the operands. We propose a new SIMD matrix multiplication instruction that uses mixed precision on its inputs (8- and 4-bit operands) and accumulates product values into narrower 16-bit output accumulators, in turn allowing the SIMD operation at 128-bit vector width to process a greater number of data elements per instruction to improve processing throughput and memory bandwidth utilization without increasing the register read- and write-port bandwidth in CPUs. The proposed asymmetric-operand-size SIMD instruction offers 2x improvement in throughput of matrix multiplication in comparison to throughput obtained using existing symmetric-operand-size instructions while causing negligible (0.05%) overflow from 16-bit accumulators for representative machine learning workloads. The asymmetric-operand-size instruction not only can improve matrix multiplication throughput in CPUs, but also can be effective to support multiply-and-accumulate (MAC) operation between 8- and 4-bit operands in state-of-the-art DNN hardware accelerators (e.g., systolic array microarchitecture in Google TPU, etc.) and offer similar improvement in matrix multiply performance seamlessly without violating the various implementation constraints. We demonstrate how a systolic array architecture designed for symmetric-operand-size instructions could be modified to support an asymmetric-operand-sized instruction.

I don't understand what "optimizing code for GPU" means. I have read that GPU's do matrix multiplication much faster than CPUs. So does it just mean that instead of calculating something sequentially, we try to turn it into a large matrix multiplication? Can you refer me to articles showing the difference with examples?

Highlights: In this post we are going to talk about vectors. They are the fundamental building blocks in Linear Algebra. We will give an intuitive definition what the vectors are, where we use them, how we add them and multiply with scalars. We provide a code examples to demonstrate how to work with vectors in Python. So, what exactly is a vector?

Zhang, Aston, Tay, Yi, Zhang, Shuai, Chan, Alvin, Luu, Anh Tuan, Hui, Siu Cheung, Fu, Jie

Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, "fully-connected layers with Quaternions" (4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of Quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily $1/n$ learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.

Learning multiplication tables is really important. Without Multiplication tables, Math becomes quite difficult to understand. This is the reason children are made to learn Times Tables in their early classes. How well the child learns the multiplication tables has a direct effect on how they learn and make progress in Math. This course teaches two of the fastest Vedic Math techniques to calculate Times Tables up to 1000 – Mentally, without any calculator.