Modeling Spatio-temporal Extremes via Conditional Variational Autoencoders