M2T: Masking Transformers Twice for Faster Decoding
Mentzer, Fabian, Agustsson, Eirikur, Tschannen, Michael
–arXiv.org Artificial Intelligence
In MaskGIT [11], the authors (see Figure 1) use a VQ-GAN [16] to map images to vector-quantized tokens, Motivated by this, we aim to employ masked transformers and learn a transformer to predict the distribution of for neural image compression. Previous work has these tokens. The key novelty of the approach was to use used masked and unmasked transformers in the entropy BERT-like [13] random masks during training to then predict model for video compression [37, 25] and image compression tokens in groups during inference, sampling tokens in [29, 22, 15]. However, these models are often either the same group in parallel at each inference step. Thereby, prohibitively slow [22], or lag in rate-distortion performance each inference step is conditioned on the tokens generated [29, 15]. In this paper, we show a conceptually in previous steps. A big advantage of BERT-like training simple transformer-based approach that is state-of-the-art in with grouped inference versus prior state-of-the-art is that neural image compression, at practical runtimes. The model considerably fewer steps are required to produce realistic is using off-the-shelf transformers, and does not rely on images (typically 10-20, rather than one per token).
arXiv.org Artificial Intelligence
Apr-14-2023