Tri-Modal Motion Retrieval by Learning a Joint Embedding Space