Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning