In this paper, we propose a generic technique to model temporal dependencies and sequences using a combination of a recurrent neural network and a Deep Belief Network. Our technique, RNN-DBN, is an amalgamation of the memory state of the RNN that allows it to provide temporal information and a multi-layer DBN that helps in high level representation of the data. This makes RNN-DBNs ideal for sequence generation. Further, the use of a DBN in conjunction with the RNN makes this model capable of significantly more complex data representation than an RBM. We apply this technique to the task of polyphonic music generation.
Let me open this article with a question – "working love learning we on deep", did this make any sense to you? Not really – read this one – "We love working on deep learning". A little jumble in the words made the sentence incoherent. Well, can we expect a neural network to make sense out of it? If the human brain was confused on what it meant I am sure a neural network is going to have a tough time deciphering such text.
We have recently shown that when initiated with "small" weights, many connectionist models with feedback connections are inherently biased towards Markov models, i.e. even prior to any training, dynamics of the models can be readily used to extract finite memory machines (Tiňo, Čerňanský, & Beňušková 2004; Hammer & Tiňo 2003). In this study we briefly outline the core arguments for such claims and generalize the results to recursive neural networks capable of processing ordered trees. In the early stages of learning, the compositional organization of recursive activations has a Markovian structure: Trees sharing a top subtree are mapped close to each other. The deeper is the shared subtree, the closer are the trees mapped.
Sophisticated gated recurrent neural network architectures like LSTMs and GRUs have been shown to be highly effective in a myriad of applications. We develop an un-gated unit, the statistical recurrent unit (SRU), that is able to learn long term dependencies in data by only keeping moving averages of statistics. The SRU's architecture is simple, un-gated, and contains a comparable number of parameters to LSTMs; yet, SRUs perform favorably to more sophisticated LSTM and GRU alternatives, often outperforming one or both in various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an unbiased manner by optimizing respective architectures' hyperparameters in a Bayesian optimization scheme for both synthetic and real-world tasks.