Hamiltonian prior to Disentangle Content and Motion in Image Sequences

Khan, Asif, Storkey, Amos

arXiv.org Artificial Intelligence 

The ability to learn to generate artificial image sequences has diverse uses, from animation, key frame generation, summarisation to restoration and has been explored in previous work over many decades (Hogg, 1983; Hurri and Hyvärinen, 2003; Cremers and Yuille, 2003; Storkey and Williams, 2003; Kannan et al., 2005). However, learning to generate arbitrary sequences is not enough; to provide useful value, the user must be able to control aspects of the sequence generation, such as the motion being enacted, or the characteristics of the agent doing an action. To enable this, we must learn to decompose image sequences into content and motion characteristics such that we can apply learnt motions to new objects or vary the types of motions being applied. Deep generative models (DGMs) such as variational autoencoders (VAEs) (Kingma and Welling, 2013) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) use neural networks (NNs) to transform the samples from a prior distribution over lower-dimensional latent factors to samples from the data distribution itself. Recent developments (Chung et al., 2015; Srivastava et al., 2015; Hsu et al., 2017; Yingzhen and Mandt, 2018) extend VAEs to sequences using Recurrent Neural Networks (RNNs) on the representation of temporal frames.