Notes on Hierarchical Multiscale Recurrent Neural Networks

#artificialintelligence 

Lots of prior work with hierarchy (hierarchical RNN / stacked RNN) and multi-scale (LSTM, clockwork RNN) but they all rely on pre-defined boundaries, pre-defined scales, or soft non-hierarchical boundaries. Avoids "soft" gating which leads to "curse of updating every timestep". Discrete (binary) decisions are difficult to optimize due to non-smooth gradients. Uses straight-through estimator (as an alternative to REINFORCE) to learn discrete variables. The simplest variant uses a step function on the forward pass and a hard sigmoid on backward pass for gradient estimation.