We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with $\sin$ activation being the most extreme. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization could cause the representations and gradients to be increasingly misaligned across examples in the same class. We further demonstrate that a similar misalignment phenomenon occurs in other scenarios affecting generalization performance, such as changes to the architecture or data distribution.
In neural networks, the backpropagation algorithm is important to train the network. But it also raises another problem. That is the problem of vanishing gradients. In this article, we are going to dive into the topic of vanishing gradients in neural networks and why it is a problem while training. The backpropagation algorithm is used to update the weights in the network with a Gradient Descent step.
Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization.
Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.
DenseNet(Densely Connected Convolutional Networks) is one of the latest neural networks for visual object recognition that obtains state of the art results on many datasets. It's quite similar to ResNet but has some fundamental differences. This post assumes previous knowledge of neural networks(NN) and convolutions(convs). If you know how DenseNets works and interested only in tensorflow implementation feel free to jump to the second chapter or check the source code on GitHub. If you not familiar with any topics but want to get some knowledge -- I highly advise you this CS231n Stanford classes.