TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Open in new window