Appendix for Video-based Human-Object Interaction Detection from Tubelet Tokens Danyang T u 1, Wei Sun