thisisanextensionoffig
MoVQ: Modulating QuantizedVectorsforHigh-FidelityImage Generation ADiscussiononMaskedImageReconstruction
Inothercolumns, werandomly masksome tokens (first row), and we sample the invisible tokens based on the visible tokens for the second stage. Here, we show top-1 results in 1 step (second row), and random results in 8 steps (third row),respectively. Interestingly, our model with 95% masked tokens (i.e., 12 tokens are visible among 256 tokens in each channel) is able to generate pluralistic images in only one step by selecting the top 1 token. More importantly, the corresponding results reflect identity attributes of original unmaskedinputs. When the tokens are totally masked (i.e., 100% mask ratio), the model generates plausible and diversity results byrandomly sampling tokens inmultiple steps.