Dense Video Understanding with Gated Residual Tokenization