Goto

Collaborating Authors

 thisisanextensionoffig


MoVQ: Modulating QuantizedVectorsforHigh-FidelityImage Generation ADiscussiononMaskedImageReconstruction

Neural Information Processing Systems

Inothercolumns, werandomly masksome tokens (first row), and we sample the invisible tokens based on the visible tokens for the second stage. Here, we show top-1 results in 1 step (second row), and random results in 8 steps (third row),respectively. Interestingly, our model with 95% masked tokens (i.e., 12 tokens are visible among 256 tokens in each channel) is able to generate pluralistic images in only one step by selecting the top 1 token. More importantly, the corresponding results reflect identity attributes of original unmaskedinputs. When the tokens are totally masked (i.e., 100% mask ratio), the model generates plausible and diversity results byrandomly sampling tokens inmultiple steps.