Dense Video Understanding with Gated Residual Tokenization

Open in new window