A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos