One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Open in new window