ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis