Two-Stream Transformer Architecture for Long Video Understanding