Class-attention Video Transformer for Engagement Intensity Prediction