Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Open in new window