Scaling Supervised Local Learning with Augmented Auxiliary Networks

Ma, Chenxiang, Wu, Jibin, Si, Chenyang, Tan, Kay Chen

arXiv.org Artificial Intelligence 

Deep neural networks are typically trained using global error signals that backpropagate (BP) end-to-end, which is not only biologically implausible but also suffers from the update locking problem and requires huge memory consumption. Local learning, which updates each layer independently with a gradient-isolated auxiliary network, offers a promising alternative to address the above problems. However, existing local learning methods are confronted with a large accuracy gap with the BP counterpart, particularly for large-scale networks. This is due to the weak coupling between local layers and their subsequent network layers, as there is no gradient communication across layers. To tackle this issue, we put forward an augmented local learning method, dubbed AugLocal. AugLocal constructs each hidden layer's auxiliary network by uniformly selecting a small subset of layers from its subsequent network layers to enhance their synergy. We also propose to linearly reduce the depth of auxiliary networks as the hidden layer goes deeper, ensuring sufficient network capacity while reducing the computational cost of auxiliary networks. Our extensive experiments on four image classification datasets (i.e., CIFAR-10, SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up to tens of local layers with a comparable accuracy to BP-trained networks while reducing GPU memory usage by around 40%. The proposed AugLocal method, therefore, opens up a myriad of opportunities for training high-performance deep neural networks on resource-constrained platforms. Artificial neural networks (ANNs) have achieved remarkable performance in pattern recognition tasks by increasing their depth (Krizhevsky et al., 2012; LeCun et al., 2015; He et al., 2016; Huang et al., 2017). However, these deep ANNs are trained end-to-end with the backpropagation algorithm (BP) (Rumelhart et al., 1985), which faces several limitations. One critical criticism of BP is its biological implausibility (Crick, 1989; Lillicrap et al., 2020), as it relies on a global objective optimized by backpropagating error signals across layers. This stands in contrast to biological neural networks that predominantly learn based on local signals (Hebb, 1949; Caporale & Dan, 2008; Bengio et al., 2015).