Perceptrons
[D] ReLU activated feed-forward network learns from back. Why? • r/MachineLearning
I've been spending some time looking at the convergence behavior of different neural networks trained on MNIST data and cross-entropy loss. I started by training deeper and deeper networks using sigmoid type activations until the learning efficiency got too low before switching to ReLU activations. After switching to ReLU activations, my network converged without too many problems but I noticed that the learning rates exhibited an interesting pattern. In particular, it takes a complete epoch before the loss begins to fall. My weights and biases are initialized uniformly with weights initialized between -0.1 and 0.1.
Attention-based Graph Neural Network for Semi-supervised Learning
Thekumparampil, Kiran K., Wang, Chong, Oh, Sewoong, Li, Li-Jia
Recently popularized graph neural networks achieve the state-of-the-art accuracy on a number of standard benchmark datasets for graph-based semi-supervised learning, improving significantly over existing approaches. These architectures alternate between a propagation layer that aggregates the hidden states of the local neighborhood and a fully-connected layer. Perhaps surprisingly, we show that a linear model, that removes all the intermediate fully-connected layers, is still able to achieve a performance comparable to the state-of-the-art models. This significantly reduces the number of parameters, which is critical for semi-supervised learning where number of labeled examples are small. This in turn allows a room for designing more innovative propagation layers. Based on this insight, we propose a novel graph neural network that removes all the intermediate fully-connected layers, and replaces the propagation layers with attention mechanisms that respect the structure of the graph. The attention mechanism allows us to learn a dynamic and adaptive local summary of the neighborhood to achieve more accurate predictions. In a number of experiments on benchmark citation networks datasets, we demonstrate that our approach outperforms competing methods. By examining the attention weights among neighbors, we show that our model provides some interesting insights on how neighbors influence each other.
Perceptron learning algorithm doesn't work
However the program runs into infinite loop and weight tends to be very large. What should I do to debug my program? If you can point out what's going wrong, it'd be also appreciated. What I'm doing here is first generate some data points at random and assign label to them according to the linear target function. Then use perceptron learning to learn this linear function.
Structured Control Nets for Deep Reinforcement Learning
Srouji, Mario, Zhang, Jian, Salakhutdinov, Ruslan
In recent years, Deep Reinforcement Learning has made impressive advances in solving several important benchmark problems for sequential decision making. Many control applications use a generic multilayer perceptron (MLP) for non-vision parts of the policy network. In this work, we propose a new neural network architecture for the policy network representation that is simple yet effective. The proposed Structured Control Net (SCN) splits the generic MLP into two separate sub-modules: a nonlinear control module and a linear control module. Intuitively, the nonlinear control is for forward-looking and global control, while the linear control stabilizes the local dynamics around the residual of global control. We hypothesize that this will bring together the benefits of both linear and nonlinear policies: improve training sample efficiency, final episodic reward, and generalization of learned policy, while requiring a smaller network and being generally applicable to different training methods. We validated our hypothesis with competitive results on simulations from OpenAI MuJoCo, Roboschool, Atari, and a custom 2D urban driving environment, with various ablation and generalization tests, trained with multiple black-box and policy gradient training methods. The proposed architecture has the potential to improve upon broader control tasks by incorporating problem specific priors into the architecture. As a case study, we demonstrate much improved performance for locomotion tasks by emulating the biological central pattern generators (CPGs) as the nonlinear part of the architecture.
The Birth of AI and The First AI Hype Cycle
Every decade seems to have its technological buzzwords: we had personal computers in 1980s; Internet and worldwide web in 1990s; smart phones and social media in 2000s; and Artificial Intelligence (AI) and Machine Learning in this decade. While artificial intelligence (AI) is among today's most popular topics, a commonly forgotten fact is that it was actually born in 1950 and went through a hype cycle between 1956 and 1982. The purpose of this article is to highlight some of the achievements that took place during the boom phase of this cycle and explain what led to its bust phase. The lessons to be learned from this hype cycle should not be overlooked – its successes formed the archetypes for machine learning algorithms used today, and its shortcomings indicated the dangers of overenthusiasm in promising fields of research and development. Although the first computers were developed during World War II [1,2], what seemed to truly spark the field of AI was a question proposed by Alan Turing in 1950 [3]: can a machine imitate human intelligence?
Generating Neural Networks with Neural Networks
Hypernetworks are neural networks that transform a random input vector into weights for a specified target neural network. We formulate the hypernetwork training objective as a compromise between accuracy and diversity, where the diversity takes into account trivial symmetry transformations of the target network. We show that this formulation naturally arises as a relaxation of an optimistic probability distribution objective for the generated networks, and we explain how it is related to variational inference. We use multi-layered perceptrons to form the mapping from the low dimensional input random vector to the high dimensional weight space, and demonstrate how to reduce the number of parameters in this mapping by weight sharing. We perform experiments on a four layer convolutional target network which classifies MNIST images, and show that the generated weights are diverse and have interesting distributions.
Neural Granger Causality for Nonlinear Time Series
Tank, Alex, Covert, Ian, Foti, Nicholas, Shojaie, Ali, Fox, Emily
While most classical approaches to Granger causality detection assume linear dynamics, many interactions in applied domains, like neuroscience and genomics, are inherently nonlinear. In these cases, using linear models may lead to inconsistent estimation of Granger causal interactions. We propose a class of nonlinear methods by applying structured multilayer perceptrons (MLPs) or recurrent neural networks (RNNs) combined with sparsity-inducing penalties on the weights. By encouraging specific sets of weights to be zero---in particular through the use of convex group-lasso penalties---we can extract the Granger causal structure. To further contrast with traditional approaches, our framework naturally enables us to efficiently capture long-range dependencies between series either via our RNNs or through an automatic lag selection in the MLP. We show that our neural Granger causality methods outperform state-of-the-art nonlinear Granger causality methods on the DREAM3 challenge data. This data consists of nonlinear gene expression and regulation time courses with only a limited number of time points. The successes we show in this challenging dataset provide a powerful example of how deep learning can be useful in cases that go beyond prediction on large datasets. We likewise demonstrate our methods in detecting nonlinear interactions in a human motion capture dataset.
Artificial Neural Networks – Part 2: MLP Implementation for XOr
As promised in part one, this second part details a java implementation of a multilayer perceptron (MLP) for the XOr problem. Actually, as you will see, the core classes are designed to implement any MLP implementation with a single hidden layer. First, it will help to introduce a quick overview of how MLP networks can be used to make predictions for the XOr problem. For a more detailed explanation, please review part one of this post. The image at the top of this article depicts the architecture for a multilayer perceptron network designed specifically to solve the XOr problem.