Multimodal and Multilingual Embeddings for Large-Scale Speech Mining