Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification