Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction