Supplementary Materialfor " Bringing Image Scene Structureto Videovia Frame-Clip Consistencyof Object Tokens "