$\epsilon$-VAE: Denoising as Visual Decoding