Everything is a Video: Unifying Modalities through Next-Frame Prediction