An Empirical Study of Autoregressive Pre-training from Videos

Rajasegaran, Jathushan, Radosavovic, Ilija, Ravishankar, Rahul, Gandelsman, Yossi, Feichtenhofer, Christoph, Malik, Jitendra

arXiv.org Artificial Intelligence 

In a paper published in 1951, Shannon, having just published the foundational papers of information theory, proposed a "guessing game" of next word prediction to estimate the entropy of English (Shannon, 1951). Nearly 70 years later, training a high-capacity transformer network (Vaswani et al., 2017) on this task, provided the generative pre-training backbone for Large Language Models (Radford et al., 2018; Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020). Less well known is the fact that in 1954, Fred Attneave (Attneave, 1954) proposed an analog of Shannon's task for images. To quote "We may divide the picture into arbitrarily small elements which we "transmit" to a subject (S) in a cumulative sequence, having them guess at the color of each successive element until they are correct. This method of analysis resembles the scanning process used in television and facsimile systems and accomplishes the like purpose of transforming two spatial dimensions into a single sequence in time".