Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics