Schmidhuber, Jürgen
Source Separation as a By-Product of Regularization
Hochreiter, Sepp, Schmidhuber, Jürgen
This paper reveals a previously ignored connection between two important fields: regularization and independent component analysis (ICA). We show that at least one representative of a broad class of algorithms (regularizers that reduce network complexity) extracts independent features as a byproduct. This algorithm is Flat Minimum Search (FMS), a recent general method for finding low-complexity networks with high generalization capability. FMS works by minimizing both training error and required weight precision. According to our theoretical analysis the hidden layer of an FMS-trained autoassociator attempts at coding each input by a sparse code with as few simple features as possible.
LSTM can Solve Hard Long Time Lag Problems
Hochreiter, Sepp, Schmidhuber, Jürgen
Standard recurrent nets cannot deal with long minimal time lags between relevant signals. Several recent NIPS papers propose alternative methods.We first show: problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms. We then use LSTM, our own recent algorithm, to solve a hard problem that can neither be quickly solved by random search nor by any other recurrent net algorithm we are aware of. 1 TRIVIAL PREVIOUS LONG TIME LAG PROBLEMS Traditional recurrent nets fail in case'of long minimal time lags between input signals andcorresponding error signals [7, 3]. Many recent papers propose alternative methods, e.g., [16, 12, 1,5,9]. For instance, Bengio et ale investigate methods such as simulated annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete error propagation [3].
LSTM can Solve Hard Long Time Lag Problems
Hochreiter, Sepp, Schmidhuber, Jürgen
Standard recurrent nets cannot deal with long minimal time lags between relevant signals. Several recent NIPS papers propose alternative methods. We first show: problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms. We then use LSTM, our own recent algorithm, to solve a hard problem that can neither be quickly solved by random search nor by any other recurrent net algorithm we are aware of.
Predictive Coding with Neural Nets: Application to Text Compression
Schmidhuber, Jürgen, Heil, Stefan
To compress text files, a neural predictor network P is used to approximate the conditional probability distribution of possible "next characters", given n previous characters. P's outputs are fed into standard coding algorithms that generate short codes for characters with high predicted probability and long codes for highly unpredictable characters. Tested on short German newspaper articles, our method outperforms widely used Lempel-Ziv algorithms (used in UNIX functions such as "compress" and "gzip").
Predictive Coding with Neural Nets: Application to Text Compression
Schmidhuber, Jürgen, Heil, Stefan
To compress text files, a neural predictor network P is used to approximate the conditional probability distribution of possible "next characters", given n previous characters. P's outputs are fed into standard coding algorithms that generate short codes for characters with high predicted probability and long codes for highly unpredictable characters. Tested on short German newspaper articles, our method outperforms widely used Lempel-Ziv algorithms (used in UNIX functions such as "compress" and "gzip").
SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA
Hochreiter, Sepp, Schmidhuber, Jürgen
Predictive Coding with Neural Nets: Application to Text Compression
Schmidhuber, Jürgen, Heil, Stefan
To compress text files, a neural predictor network P is used to approximate theconditional probability distribution of possible "next characters", given n previous characters. P's outputs are fed into standard coding algorithms that generate short codes for characters with high predicted probability and long codes for highly unpredictable characters.Tested on short German newspaper articles, our method outperforms widely used Lempel-Ziv algorithms (used in UNIX functions such as "compress" and "gzip").
SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA
Hochreiter, Sepp, Schmidhuber, Jürgen
We present a new algorithm for finding low complexity networks with high generalization capability. The algorithm searches for large connected regions of so-called ''fiat'' minima of the error function. In the weight-space environment of a "flat" minimum, the error remains approximately constant. Using an MDL-based argument, flat minima can be shown to correspond to low expected overfitting. Although our algorithm requires the computation of second order derivatives, it has backprop's order of complexity. Experiments with feedforward and recurrent nets are described. In an application to stock market prediction, the method outperforms conventional backprop, weight decay, and "optimal brain surgeon".
Learning Unambiguous Reduced Sequence Descriptions
Schmidhuber, Jürgen
Do you want your neural net algorithm to learn sequences? Do not limit yourself to conventional gradient descent (or approximations thereof). Instead, use your sequence learning algorithm (any will do) to implement the following method for history compression. No matter what your final goals are, train a network to predict its next input from the previous ones. Since only unpredictable inputs convey new information, ignore all predictable inputs but let all unexpected inputs (plus information about the time step at which they occurred) become inputs to a higher-level network of the same kind (working on a slower, self-adjusting time scale). Go on building a hierarchy of such networks.
Learning Unambiguous Reduced Sequence Descriptions
Schmidhuber, Jürgen
Do you want your neural net algorithm to learn sequences? Do not limit yourselfto conventional gradient descent (or approximations thereof). Instead, use your sequence learning algorithm (any will do) to implement the following method for history compression. No matter what your final goalsare, train a network to predict its next input from the previous ones. Since only unpredictable inputs convey new information, ignore all predictable inputs but let all unexpected inputs (plus information about the time step at which they occurred) become inputs to a higher-level network of the same kind (working on a slower, self-adjusting time scale). Go on building a hierarchy of such networks.