Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding