An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture