Training Deep Neural Networks with 8-bit Floating Point Numbers
Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, Kailash Gopalakrishnan
–Neural Information Processing Systems
Firstly,when all the operands (i.e., weights, activations, errors and gradients) for general matrix multiplication (GEMM) and convolution computations are reduced to 8 bits, most DNNs suffer noticeable accuracy degradation (e.g., Figure 1(a)).
Neural Information Processing Systems
Feb-19-2026, 16:35:04 GMT