Pushing the Boundaries of State Space Models for Image and Video Generation