to

### Extreme Memorization via Scale of Initialization

We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with $\sin$ activation being the most extreme. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization could cause the representations and gradients to be increasingly misaligned across examples in the same class. We further demonstrate that a similar misalignment phenomenon occurs in other scenarios affecting generalization performance, such as changes to the architecture or data distribution.

### Efficient Convolutional Neural Network Training with Direct Feedback Alignment

There were many algorithms to substitute the back-propagation (BP) in the deep neural network (DNN) training. However, they could not become popular because their training accuracy and the computational efficiency were worse than BP. One of them was direct feedback alignment (DFA), but it showed low training performance especially for the convolutional neural network (CNN). In this paper, we overcome the limitation of the DFA algorithm by combining with the conventional BP during the CNN training. To improve the training stability, we also suggest the feedback weight initialization method by analyzing the patterns of the fixed random matrices in the DFA. Finally, we propose the new training algorithm, binary direct feedback alignment (BDFA) to minimize the computational cost while maintaining the training accuracy compared with the DFA. In our experiments, we use the CIFAR-10 and CIFAR-100 dataset to simulate the CNN learning from the scratch and apply the BDFA to the online learning based object tracking application to examine the training in the small dataset environment. Our proposed algorithms show better performance than conventional BP in both two different training tasks especially when the dataset is small.

### Sparse Transfer Learning via Winning Lottery Tickets

The recently proposed Lottery Ticket Hypothesis of Frankle and Carbin (2019) suggests that the performance of over-parameterized deep networks is due to the random initialization seeding the network with a small fraction of favorable weights. These weights retain their dominant status throughout training -- in a very real sense, this sub-network "won the lottery" during initialization. The authors find sub-networks via unstructured magnitude pruning with 85-95% of parameters removed that train to the same accuracy as the original network at a similar speed, which they call winning tickets. In this paper, we extend the Lottery Ticket Hypothesis to a variety of transfer learning tasks. We show that sparse sub-networks with approximately 90-95% of weights removed achieve (and often exceed) the accuracy of the original dense network in several realistic settings. We experimentally validate this by transferring the sparse representation found via pruning on CIFAR-10 to SmallNORB and FashionMNIST for object recognition tasks.

### Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

Hardware accelerations of deep learning systems have been extensively investigated in industry and academia. The aim of this paper is to achieve ultra-high energy efficiency and performance for hardware implementations of deep neural networks (DNNs). An algorithm-hardware co-optimization framework is developed, which is applicable to different DNN types, sizes, and application scenarios. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio. It applies to both fully-connected and convolutional layers and contains a mathematically rigorous proof of the effectiveness of the method. The proposed algorithm reduces computational complexity per layer from O(n 2 ) to O(n log n) and storage complexity from O(n 2 ) to O(n), both for training and inference. The hardware part consists of highly efficient Field Programmable Gate Array (FPGA)-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and hierarchical control. Experimental results demonstrate that the proposed framework achieves at least 152X speedup and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work.

### 2L-3W: 2-Level 3-Way Hardware-Software Co-Verification for the Mapping of Deep Learning Architecture (DLA) onto FPGA Boards

FPGAs have become a popular choice for deploying deep learning architectures (DLA). There are many researchers that have explored the deployment and mapping of DLA on FPGA. However, there has been a growing need to do design-time hardware-software co-verification of these deployments. To the best of our knowledge this is the first work that proposes a 2-Level 3-Way (2L-3W) hardware-software co-verification methodology and provides a step-by-step guide for the successful mapping, deployment and verification of DLA on FPGA boards. The 2-Level verification is to make sure the implementation in each stage (software and hardware) are following the desired behavior. The 3-Way co-verification provides a cross-paradigm (software, design and hardware) layer-by-layer parameter check to assure the correct implementation and mapping of the DLA onto FPGA boards. The proposed 2L-3W co-verification methodology has been evaluated over several test cases. In each case, the prediction and layer-by-layer output of the DLA deployed on PYNQ FPGA board (hardware) alongside with the intermediate design results of the layer-by-layer output of the DLA implemented on Vivado HLS and the prediction and layer-by-layer output of the software level (Caffe deep learning framework) are compared to obtain a layer-by-layer similarity score. The comparison is achieved using a completely automated Python script. The comparison provides a layer-by-layer similarity score that informs us the degree of success of the DLA mapping to the FPGA or help identify in design time the layer to be debugged in the case of unsuccessful mapping. We demonstrated our technique on LeNet DLA and Caffe inspired Cifar-10 DLA and the co-verification results yielded layer-by-layer similarity scores of 99\% accuracy.